神经网络中的数据预处理方法 Data Preprocessing
0、Principal component analysis (PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation). The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
1. Rescale Data
When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
You can rescale your data using scikit-learn using the MinMaxScaler class.
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
After rescaling you can see that all of the values are in the range between 0 and 1.
[[ 0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]
[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]
2. Standardize Data
Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.
You can standardize data using scikit-learn with the StandardScaler class.
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
The values for each attribute now have a mean value of 0 and a standard deviation of 1.
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]
3. Normalize Data
Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).
This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.
You can normalize data in Python with scikit-learn using the Normalizer class.
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
The rows are normalized to length 1.
[[ 0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]
[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]
4. Binarize Data (Make Binary)
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.
You can create new binary attributes in Python using scikit-learn with the Binarizer class.
# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])
You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.
[[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]
Summary
In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.
You now have recipes to:
- Rescale data.
- Standardize data.
- Normalize data.
- Binarize data.
Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.
神经网络中的数据预处理方法 Data Preprocessing的更多相关文章
- sklearn中常用数据预处理方法
1. 标准化(Standardization or Mean Removal and Variance Scaling) 变换后各维特征有0均值,单位方差.也叫z-score规范化(零均值规范化).计 ...
- python中常用的九种数据预处理方法分享
Spyder Ctrl + 4/5: 块注释/块反注释 本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(St ...
- sklearn中的数据预处理和特征工程
小伙伴们大家好~o( ̄▽ ̄)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...
- 机器学习实战基础(八):sklearn中的数据预处理和特征工程(一)简介
1 简介 数据挖掘的五大流程: 1. 获取数据 2. 数据预处理 数据预处理是从数据中检测,纠正或删除损坏,不准确或不适用于模型的记录的过程 可能面对的问题有:数据类型不同,比如有的是文字,有的是数字 ...
- 返回数据中提取数据的方法(JSON数据取其中某一个值的方法)
返回数据中提取数据的方法 比如下面的案例是,取店铺名称 接口返回数据如下: {"Code":0,"Msg":"ok","Data& ...
- 最简单删除SQL Server中所有数据的方法
最简单删除SQL Server中所有数据的方法 编写人:CC阿爸 2014-3-14 其实删除数据库中数据的方法并不复杂,为什么我还要多此一举呢,一是我这里介绍的是删除数据库的所有数据,因为数据之间 ...
- 关于VUE调用父实例($parent) 根实例 中的数据和方法
this.$parent或者 this.$root 在子组件中判断this.$parent获取的实例是不是父组件的实例 在子组件中console.log(this.$parent) 在父组件中con ...
- 使用JDBC从数据库中查询数据的方法
* ResultSet 结果集:封装了使用JDBC 进行查询的结果 * 1. 调用Statement 对象的 executeQuery(sql) 方法可以得到结果集 * 2. ResultSet 返回 ...
- 机器学习实战基础(九):sklearn中的数据预处理和特征工程(二) 数据预处理 Preprocessing & Impute 之 数据无量纲化
1 数据无量纲化 在机器学习算法实践中,我们往往有着将不同规格的数据转换到同一规格,或不同分布的数据转换到某个特定分布的需求,这种需求统称为将数据“无量纲化”.譬如梯度和矩阵为核心的算法中,譬如逻辑回 ...
随机推荐
- js 正则表达式 验证小数点后几位
function IsFloatByBit (value, state, bit) { if (state == false) { var re ...
- java性能监控工具:jmap命令详解
.命令基本概述 Jmap是一个可以输出所有内存中对象的工具,甚至可以将VM 中的heap,以二进制输出成文本.打印出某个java进程(使用pid)内存内的,所有‘对象’的情况(如:产生那些对象,及其数 ...
- iOS 给tableView设置contentInset不生效?
给tableView设置contentInset的时候如果tableView中内容比较多,超过一个屏幕,设置的contentInset是生效的,但是呢,如果页面内容比较少,我们会发现设置content ...
- Ubuntu 12.04 server 如何安装 OpenERP 7(转)
不经意的一次看到OpenERP这个开源ERP,就被其丰富的功能,简洁的画面,熟悉的语言所吸引.迫不及待的多方查询资料,自己架设一个测试环境来进行了解.以下为测试安装时候的步骤说明,以备查询,并供有需要 ...
- JVM Specification 9th Edition (2) Chapter 1. Introduction
Chapter 1. Introduction 翻译太累了,我就这样的看英文吧. 内容列表 1.1. A Bit of History 1.2. The Java Virtual Machine 1. ...
- PAT001 一元多项式求导
题目: 设计函数求一元多项式的导数.(注:xn(n为整数)的一阶导数为n*xn-1.) 输入格式:以指数递降方式输入多项式非零项系数和指数(绝对值均为不超过1000的整数).数字间以空格分隔. 输出格 ...
- 第二百二十五节,jQuery EasyUI,PropertyGird(属性表格)组件
jQuery EasyUI,PropertyGird(属性表格)组件 学习要点: 1.加载方式 2.属性列表 3.方法列表 本节课重点了解 EasyUI 中 PropertyGird(属性表格)组件的 ...
- mysql DBA 指南
Mysql目录 数据库介绍.常见分类 Mysql入门 Mysql安装配置 Mysql多实例安装配置 Mysql常用基本命令 Mysql权限体系 Mysql数据库备份和恢复 Mysql日志 Mysql逻 ...
- TortoiseGit上传项目代码到github方法(超简单)
Github是咱广大开发者用的非常多的项目代码版本管理网站,项目托管可以是私人的(private)或者公开的(public),私人的收费,一个月7美金.咱这里就只说我们个人使用的,一般都是代码对外开放 ...
- Sublime Text 加入右键菜单
Sublime Text 2 是现在很受大家欢迎的编辑器了,不仅是在web前端,在书定简单的php.Js等代码时,也是相当的好用,再配合多种的插件和新颖的界面,更是让人欲罢不能. 在使用时,我们通过喜 ...