[Feature] Compare the effect of different scalers
Ref: Compare the effect of different scalers on data with outliers
主要是对该代码的学习研究。
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer from sklearn.datasets import fetch_california_housing print(__doc__) dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target # Take only 2 features to make visualization easier
# Feature of 0 has a long tail distribution.
# Feature 5 has a few but very large outliers. X = X_full[:, [0, 5]]
################################################################
distributions = [
('Unscaled data', X),
('Data after standard scaling',
StandardScaler().fit_transform(X)),
('Data after min-max scaling',
MinMaxScaler().fit_transform(X)),
('Data after max-abs scaling',
MaxAbsScaler().fit_transform(X)),
('Data after robust scaling',
RobustScaler(quantile_range=(25, 75)).fit_transform(X)),
('Data after power transformation (Yeo-Johnson)',
PowerTransformer(method='yeo-johnson').fit_transform(X)),
('Data after power transformation (Box-Cox)',
PowerTransformer(method='box-cox').fit_transform(X)),
('Data after quantile transformation (gaussian pdf)',
QuantileTransformer(output_distribution='normal').fit_transform(X)),
('Data after quantile transformation (uniform pdf)',
QuantileTransformer(output_distribution='uniform').fit_transform(X)),
('Data after sample-wise L2 normalizing',
Normalizer().fit_transform(X)),
] # scale the output between 0 and 1 for the colorbar
y = minmax_scale(y_full)
Original data
Each transformation is plotted showing two transformed features, with the left plot showing the entire dataset, and the right zoomed-in to show the dataset without the marginal outliers. A large majority of the samples are compacted to a specific range, [0, 10] for the median income and [0, 6] for the number of households. Note that there are some marginal outliers (some blocks have more than 1200 households). Therefore, a specific pre-processing can be very beneficial depending of the application. In the following, we present some insights and behaviors of those pre-processing methods in the presence of marginal outliers.
make_plot(0)
StandardScaler
StandardScaler
removes the mean and scales the data to unit variance. However, the outliers have an influence when computing the empirical mean and standard deviation which shrink the range of the feature values as shown in the left figure below. Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different: most of the data lie in the [-2, 4] range for the transformed median income feature while the same data is squeezed in the smaller [-0.2, 0.2] range for the transformed number of households.
StandardScaler
therefore cannot guarantee balanced feature scales in the presence of outliers.
- 收敛速度
- 不同属性列的数据可比性
- 不太适用outliers情况
- 不适用稀疏数据
make_plot(1)
MinMaxScaler
MinMaxScaler
rescales the data set such that all feature values are in the range [0, 1] as shown in the right panel below. However, this scaling compress all inliers in the narrow range [0, 0.005] for the transformed number of households.
As StandardScaler
, MinMaxScaler
is very sensitive to the presence of outliers.
- 保留了结构,可用于稀疏数据
make_plot(2)
MaxAbsScaler
MaxAbsScaler
differs from the previous scaler such that the absolute values are mapped in the range [0, 1]. On positive only data, this scaler behaves similarly to MinMaxScaler
and therefore also suffers from the presence of large outliers.
- 保留了结构,可用于稀疏数据
make_plot(3)
RobustScaler
Unlike the previous scalers, the centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar: for both features most of the transformed values lie in a [-2, 3] range as seen in the zoomed-in figure. Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).
- 使outlier点保留了离群特征。
make_plot(4)
PowerTransformer
PowerTransformer
applies a power transformation to each feature to make the data more Gaussian-like. Currently, PowerTransformer
implements the Yeo-Johnson and Box-Cox transforms. The power transform finds the optimal scaling factor to stabilize variance and mimimize skewness through maximum likelihood estimation. By default, PowerTransformer
also applies zero-mean, unit variance normalization to the transformed output. Note that Box-Cox can only be applied to strictly positive data. Income and number of households happen to be strictly positive, but if negative values are present the Yeo-Johnson transformed is to be preferred.
make_plot(5)
make_plot(6)
QuantileTransformer (Gaussian output)
QuantileTransformer
has an additional output_distribution
parameter allowing to match a Gaussian distribution instead of a uniform distribution. Note that this non-parametetric transformer introduces saturation artifacts for extreme values.
make_plot(7)
QuantileTransformer (uniform output)
QuantileTransformer
applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform distribution. In this case, all the data will be mapped in the range [0, 1], even the outliers which cannot be distinguished anymore from the inliers.
As RobustScaler
, QuantileTransformer
is robust to outliers in the sense that adding or removing outliers in the training set will yield approximately the same transformation on held out data. But contrary to RobustScaler
, QuantileTransformer
will also automatically collapse any outlier by setting them to the a priori defined range boundaries (0 and 1).
make_plot(8)
Normalizer
The Normalizer
rescales the vector for each sample to have unit norm, independently of the distribution of the samples. It can be seen on both figures below where all samples are mapped onto the unit circle. In our example the two selected features have only positive values; therefore the transformed data only lie in the positive quadrant. This would not be the case if some original features had a mix of positive and negative values.
- 经常在文本分类和聚类当中使用
make_plot(9) plt.show()
[Feature] Compare the effect of different scalers的更多相关文章
- [ML] Feature Transformers
方案选择可参考:[Scikit-learn] 4.3 Preprocessing data 代码示范可参考:[ML] Pyspark ML tutorial for beginners 本篇涉及:Fe ...
- sklearn中的数据预处理----good!! 标准化 归一化 在何时使用
RESCALING attribute data to values to scale the range in [0, 1] or [−1, 1] is useful for the optimiz ...
- 改变BootStrap主题颜色
摘自:http://www.asp.net/visual-studio/overview/2013/creating-web-projects-in-visual-studio#bootstrap Y ...
- aspNet各种模块介绍
For browsers that do not support HTML5, you can use Modernizr. Modernizr is an open-source JavaScrip ...
- Get Started with the Google Fonts API
Get Started with the Google Fonts API This guide explains how to use the Google Fonts API to add fon ...
- A successful Git branching model——经典篇
A successful Git branching model In this post I present the development model that I’ve introduced f ...
- Jackson 工具类使用及配置指南
目录 前言 Jackson使用工具类 Jackson配置属性 Jackson解析JSON数据 Jackson序列化Java对象 前言 Json数据格式这两年发展的很快,其声称相对XML格式有很对好处: ...
- Jackson工具类使用及配置指南、高性能配置(转)
Jackson使用工具类 通常,我们对JSON格式的数据,只会进行解析和封装两种,也就是JSON字符串--->Java对象以及Java对象--->JSON字符串. public class ...
- Create the Project
https://docs.microsoft.com/en-us/aspnet/web-forms/overview/getting-started/getting-started-with-aspn ...
随机推荐
- asp.net 设计音乐网站
第一步 收集资料 http://www.logoko.com.cn/ --设计logo网站 设计音乐文档 https://wenku.baidu.com/view/3d957617f18583 ...
- 蓝牙RSSI转距离计算工具
RSSI是无线接收的信号强度指示,如WIFI.BLE.ZigBee.接收到的RSSI的强弱与发射点与接收点的距离有一定的关系,故可以依据RSSI进行粗略的定位计算,如苹果的iBeacon. 其中用到最 ...
- Linux上jdk,mysql,tomcat安装
一:RPM(红帽软件包管理器):相当于windows的添加/卸载程序(控制面板),进行程序的安装.更新.卸载.查看: 本地程序安装:rpm -ivh 程序名 本地程序查看:rpm -qa 本地程序卸载 ...
- 团队第三次作业:Alpha版本第二周小结
姓名 学号 周前计划安排 每周实际工作记录 自我打分 XXX 061109 1.对原型设计与编码任务进行进一步的规划与任务分配 2.协调与统一已完成的部分原型设计页面风格并针对部分页面提出了改进建议 ...
- Java字节流read函数
问题引入 做Java作业从标准输入流获取用户输入,用到了System.in.read(),然后出现了bug. //随机生成一个小写字母,用户猜5次,读取用户输入,并判断是否猜对 import java ...
- redis cluster 介绍
介绍 1. cluster的作用 (1)自动将数据进行分片,每个master上放一部分数据 (2)提供内置的高可用支持,部分master不可用时,还是可以继续工作的 2. redis集群实现方案 关于 ...
- elementUI 上传文件图片大小加了限制后 仍然上传了
https://blog.csdn.net/chanlingmai5374/article/details/80558444 看了这位老哥的说法 在看看文档 才发现自己没认真看文档 <el-u ...
- SDL 小例子
以下利用SDL播放网络流,需要自己配置运行环境,包括SDL和FFmpeg // ConsoleApplication2.cpp : 定义控制台应用程序的入口点. // /* #include &quo ...
- 用CSS实现梯形图标
遇到需要实现如下图标 由图形分析,梯形,平行四边形等都可以由矩形变形而来. 而想要实现梯形,需要进行3D变换,需要使用css3的 perspective属性. 属性 perspective指定了观察者 ...
- 小米oj 反向位整数(简单位运算)
反向位整数 序号:#30难度:一般时间限制:1000ms内存限制:10M 描述 输入32位无符号整数,输出它的反向位. 例,输入4626149(以二进制表示为00000000010001101001 ...