0、Principal component analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation). The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

After rescaling you can see that all of the values are in the range between 0 and 1.

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

The rows are normalized to length 1.

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

  • Rescale data.
  • Standardize data.
  • Normalize data.
  • Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

神经网络中的数据预处理方法 Data Preprocessing的更多相关文章

  1. sklearn中常用数据预处理方法

    1. 标准化(Standardization or Mean Removal and Variance Scaling) 变换后各维特征有0均值,单位方差.也叫z-score规范化(零均值规范化).计 ...

  2. python中常用的九种数据预处理方法分享

    Spyder   Ctrl + 4/5: 块注释/块反注释 本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(St ...

  3. sklearn中的数据预处理和特征工程

    小伙伴们大家好~o( ̄▽ ̄)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...

  4. 机器学习实战基础(八):sklearn中的数据预处理和特征工程(一)简介

    1 简介 数据挖掘的五大流程: 1. 获取数据 2. 数据预处理 数据预处理是从数据中检测,纠正或删除损坏,不准确或不适用于模型的记录的过程 可能面对的问题有:数据类型不同,比如有的是文字,有的是数字 ...

  5. 返回数据中提取数据的方法(JSON数据取其中某一个值的方法)

    返回数据中提取数据的方法 比如下面的案例是,取店铺名称 接口返回数据如下: {"Code":0,"Msg":"ok","Data& ...

  6. 最简单删除SQL Server中所有数据的方法

     最简单删除SQL Server中所有数据的方法 编写人:CC阿爸 2014-3-14 其实删除数据库中数据的方法并不复杂,为什么我还要多此一举呢,一是我这里介绍的是删除数据库的所有数据,因为数据之间 ...

  7. 关于VUE调用父实例($parent) 根实例 中的数据和方法

    this.$parent或者 this.$root 在子组件中判断this.$parent获取的实例是不是父组件的实例 在子组件中console.log(this.$parent)  在父组件中con ...

  8. 使用JDBC从数据库中查询数据的方法

    * ResultSet 结果集:封装了使用JDBC 进行查询的结果 * 1. 调用Statement 对象的 executeQuery(sql) 方法可以得到结果集 * 2. ResultSet 返回 ...

  9. 机器学习实战基础(九):sklearn中的数据预处理和特征工程(二) 数据预处理 Preprocessing & Impute 之 数据无量纲化

    1 数据无量纲化 在机器学习算法实践中,我们往往有着将不同规格的数据转换到同一规格,或不同分布的数据转换到某个特定分布的需求,这种需求统称为将数据“无量纲化”.譬如梯度和矩阵为核心的算法中,譬如逻辑回 ...

随机推荐

  1. Google Analytics Advanced Configuration - Google Analytics 高级配置

    该文档提供了Android SDK v3的部分元素的高级配置说明. Overview - 概述 Android Google Analytics SDK提供了Tracker类,应用可以用它给Googl ...

  2. dubbo入门使用

    主要参考dubbo官网demo 此处采用zookeeper注册中心进行服务协调管理 真个项目结构如下所示: dcommon : 主要用于定义服务接口, 为dconsumer,dprovider所依赖 ...

  3. php 合并数组的方法 非array_merge

    Array ( [0] => Array ( [max] => 50 [date] => 2016-01-07 ) [1] => Array ( [max] => 100 ...

  4. Windows Server2008R2中导入Excel

    使用Microsoft.ACE.OLEDB对Excel进行操作: string strConn = "Provider=Microsoft.ACE.OLEDB.12.0;" + & ...

  5. (转)使用.NET Reflector 查看Unity引擎里面的DLL文件

    当你查看unity里面API的时候,是不是有时候追踪了一两步就碰到DLL文件走不下去了呢?很是不爽吧. 这种问题我也是经常碰到.这是人家商业引擎不想让你看到底层代码啦,所以着急不得. 不过,今天我终于 ...

  6. WPF实用知识点

    1.一个基本的WPF程序, 需要引入的程序集WindowsBase, PresentationCore, PresentationFramework using System; using Syste ...

  7. nginx已经启动 无法访问页面

    通过IP访问,可以看到  welcome nginx 的提示 下面我重启linux服务器,重启后通过ip访问,死活连接不上了?没办法了,只有在百度和google 最后发现问题不是出在nginx上,而是 ...

  8. 初步了解 cURL

    今天需要用PHP模拟post请求,查了查资料,了解到cURL.看了一篇博客,写的很详细,就转载了,与大家分享.[原文链接] 什么是cURL?可能还有很多同学没有听说过这个工具,我先来给大家简单介绍下什 ...

  9. win10下Import caffe时出现“ImportError: No module named google.protobuf.internal”的解决办法

    解决方法:只要出现和protobuf相关的错误,只要在cmd中输入pip install protobuf,然后等待安装完成即可. ps:这时,可能会出现"pip 不是内部命令"之 ...

  10. JZOJ.3769【NOI2015模拟8.14】A+B

    Description 对于每个数字x,我们总可以把它表示成一些斐波拉切数字之和,比如8 = 5 + 3,  而22 = 21 + 1,因此我们可以写成  x = a1 * Fib1 + a2 * F ...