特征选择
代码下载
本文主要介绍sklearn中进行特征选择的方法。
sklearn.feature_selection模块中的类可用于样本集的特征选择/降维，以提高估计量的准确性得分或提高其在超高维数据集上的性能。

文章目录

1 SelectFromModel基础使用
2 SelectFromModel中不同的特征选择方法
- 2.1 基于L1范式进行特征选择
- 2.2 基于树的特征选择
3 参考

SelectFromModel 是一个基础分类器，其根据重要性权重选择特征。可与拟合后具有coef_或feature_importances_属性的任何估计器一起使用。如果相应的coef_或feature_importances_值低于提供的threshold参数，则这些特征可以认为不重要或者删除。除了指定数值阈值参数，还可以使用字符串参数查找阈值，参数包括：“mean”, “median” 以及这两个参数的浮点数乘积，例如“0.1*mean”。与threshold标准结合使用时，可以通过max_features参数限制选择的特征数量。

# 多行输出

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import warnings

warnings.filterwarnings("ignore")

1 SelectFromModel基础使用

主要参数说明如下：

threshold：用于特征选择的阈值。保留重要性更高或相等的特征，而其特征则被丢弃。threshold为特征重要性的mean或者median，如果为None，并且估计器的参数惩罚显式或隐式设置为l1（例如Lasso），则使用的阈值为1e-5。否则，默认情况下使用mean。
prefit：预设模型是否期望直接传递给构造函数。
norm_order：在估算器threshold的coef_属性为维度2 的情况下，用于过滤以下系数矢量的范数的顺序
max_features：要选择的最大特征数

基础使用方法如下：

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression

X = [[ 0.87, -1.34,  0.31 ],

     [-2.79, -0.02, -0.85 ],

     [-1.34, -0.48, -2.55 ],

     [ 1.92,  1.48,  0.65 ]]

y = [0, 1, 0, 1]

# 建立评估器

selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)

# estimator的模型参数

print("estimator的模型参数",selector.estimator_.coef_)

# 根据estimator中特征重要性均值获得阈值

print("用于特征选择的阈值；",selector.threshold_)

# 哪些特征入选最后特征，true表示入选

print("特征是否保留",selector.get_support())

# 获得最后结果

print("特征提取结果",selector.transform(X));

estimator的模型参数 [[-0.32857694  0.83411609  0.46668853]]

用于特征选择的阈值； 0.5431271870420733

特征是否保留 [False  True False]

特征提取结果 [[-1.34]

 [-0.02]

 [-0.48]

 [ 1.48]]

下面演示从糖尿病数据集中选择两个最重要的特征，而无需事先知道阈值。

使用SelectFromModel和Lasso回归模型可以从糖尿病数据集中选择最佳的特征。由于L1规范促进了特征的稀疏性，我们可能只对从数据集中选择最有趣特征的子集感兴趣。本示例说明如何从糖尿病数据集中选择两个最有趣的特征。

糖尿病数据集由从442名糖尿病患者中收集的10个变量（特征）组成。此示例显示了如何使用SelectFromModel和LassoCv查找预测从基线开始一年后疾病进展的最佳两个特征。

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_diabetes

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LassoCV

首先，让我们加载sklearn中可用的糖尿病数据集。然后，我们将看看为糖尿病患者收集了哪些特征：

diabetes = load_diabetes()

X = diabetes.data

y = diabetes.target

feature_names = diabetes.feature_names

print(feature_names)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

为了确定功能的重要性，我们将使用LassoCV估计器。具有最高绝对值的特征coef_被认为是最重要的。sklearn中coef_说明见：https://www.jianshu.com/p/6a818b53a37e

clf = LassoCV().fit(X, y)

# 模型参数的绝对值

importance = np.abs(clf.coef_)

print("十个特征的重要性：",importance);

十个特征的重要性： [  0.         226.2375274  526.85738059 314.44026013 196.92164002

   1.48742026 151.78054083 106.52846989 530.58541123  64.50588257]

接下来我们可以从具有最高分数的模型特征中过滤特征。现在我们要选择两个最重要特征electFromModel()允许设置阈值。仅保留coef_高于阈值的特征。在这里，我们希望将阈值设置为略高于coef_。LassoCV()根据数据计算出高于第三高特征的高阈值。

# 对重要性进行排名

idx_third = importance.argsort()[-3]

threshold = importance[idx_third] + 0.01

print('阈值为',threshold)

idx_features = (-importance).argsort()[:2]

name_features = np.array(feature_names)[idx_features]

print('重要性第三大的特征: {}'.format(name_features))

sfm = SelectFromModel(clf, threshold=threshold);

sfm.fit(X, y)

# X_transform为特征提取结果

X_transform = sfm.transform(X)

# 提取的特征数

n_features = sfm.transform(X).shape[1]

阈值为 314.4502601292063

重要性第三大的特征: ['s5' 'bmi']

SelectFromModel(estimator=LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,

    max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,

    positive=False, precompute='auto', random_state=None,

    selection='cyclic', tol=0.0001, verbose=False),

        max_features=None, norm_order=1, prefit=False,

        threshold=314.4502601292063)

最后，我们将绘制从数据中选择的两个特征。

plt.title(

    "Features from diabets using SelectFromModel with "

    "threshold %0.3f." % sfm.threshold)

feature1 = X_transform[:, 0]

feature2 = X_transform[:, 1]

plt.plot(feature1, feature2, 'r.')

plt.xlabel("First feature: {}".format(name_features[0]))

plt.ylabel("Second feature: {}".format(name_features[1]))

plt.ylim([np.min(feature2), np.max(feature2)])

plt.show();

2 SelectFromModel中不同的特征选择方法

2.1 基于L1范式进行特征选择

用L1范数惩罚的线性模型具有稀疏解：它们的许多估计系数为零。当目标是减少数据的维数以用于另一个分类器时，它们可以与特征一起使用。L1正则化将系数l1范数作为惩罚项添加损失函数上，由于正则项非零，这就迫使那些弱的特征所对应的系数变成0。因此L1正则化往往会使学到的模型很稀疏（系数w经常为0），这个使得L1正则化成为一种常用的征选择方法。特别的用于此目的的稀疏估计量是linear_model.Lasso用于回归，和linear_model.LogisticRegression以及svm.LinearSVC 用于分类。

简单实例如下：

from sklearn.svm import LinearSVC

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)

X.shape

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)

model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(X)

X_new.shape

(150, 4)

(150, 3)

对于SVM和logistic回归，参数C控制稀疏性：C越小，选择的特征就越少。使用Lasso，alpha参数越高，选择的特征就越少。

2.2 基于树的特征选择

基于树的评估器可以被用来计算特征的重要性，然后可以根据特征的重要性去除不相关的特征 (当配合sklearn.feature_selection.SelectFromModel meta-transformer):

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)

X.shape

clf = ExtraTreesClassifier(n_estimators=50)

clf = clf.fit(X, y)

print("特征重要性",clf.feature_importances_)  

model = SelectFromModel(clf, prefit=True)

X_new = model.transform(X)

# 最后保留的特征数

X_new.shape

(150, 4)

特征重要性 [0.07293638 0.06131363 0.44503667 0.42071333]

(150, 2)

以下例子展示了如何使用随机森林来评估人工分类任务中特征的重要性。红色柱状表示特征的重要性及标准差。不出所料，该图表明3个特征是有信息的，而其余特征则没有。

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features

# 构造一个数据集，特征数为10，但是有效特征只有三个

X, y = make_classification(n_samples=1000,

                           n_features=10,

                           n_informative=3,

                           n_redundant=0,

                           n_repeated=0,

                           n_classes=2,

                           random_state=0,

                           shuffle=False)

# Build a forest and compute the impurity-based feature importances

# 通过随机森林进行分类，并计算各个特征的重要性

forest = ExtraTreesClassifier(n_estimators=250,

                              random_state=0)

forest.fit(X, y)

# 获得重要性

importances = forest.feature_importances_

# 计算标准差

std = np.std([tree.feature_importances_ for tree in forest.estimators_],

             axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking

# 获得特征排名结果

print("Feature ranking:")

for f in range(X.shape[1]):

    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the impurity-based feature importances of the forest

# 画出各个特征的重要性

plt.figure()

plt.title("Feature importances")

plt.bar(range(X.shape[1]), importances[indices],

        color="r", yerr=std[indices], align="center")

plt.xticks(range(X.shape[1]), indices)

plt.xlim([-1, X.shape[1]]);

Feature ranking:

1. feature 1 (0.295902)

2. feature 2 (0.208351)

3. feature 0 (0.177632)

4. feature 3 (0.047121)

5. feature 6 (0.046303)

6. feature 8 (0.046013)

7. feature 7 (0.045575)

8. feature 4 (0.044614)

9. feature 9 (0.044577)

10. feature 5 (0.043912)

以下示例显示了在图像分类任务（面部）中使用随机森林评估像素点中特征的重要性。像素的热值越高，表明该点对于人脸分类越重要。

from time import time

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces

from sklearn.ensemble import ExtraTreesClassifier

# Number of cores to use to perform parallel fitting of the forest model

n_jobs = 1

# Load the faces dataset

data = fetch_olivetti_faces()

X, y = data.data, data.target

mask = y < 5  # Limit to 5 classes

X = X[mask]

y = y[mask]

# Build a forest and compute the pixel importances

print("Fitting ExtraTreesClassifier on faces data with %d cores..." % n_jobs)

t0 = time()

forest = ExtraTreesClassifier(n_estimators=1000,

                              max_features=128,

                              n_jobs=n_jobs,

                              random_state=0)

forest.fit(X, y)

print("done in %0.3fs" % (time() - t0))

# 获得各点的重要性

importances = forest.feature_importances_

importances = importances.reshape(data.images[0].shape)

# Plot pixel importances

plt.matshow(importances, cmap=plt.cm.hot)

plt.title("Pixel importances with forests of trees")

plt.show()

Fitting ExtraTreesClassifier on faces data with 1 cores...

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',

           max_depth=None, max_features=128, max_leaf_nodes=None,

           min_impurity_decrease=0.0, min_impurity_split=None,

           min_samples_leaf=1, min_samples_split=2,

           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,

           oob_score=False, random_state=0, verbose=0, warm_start=False)

done in 1.018s

<matplotlib.image.AxesImage at 0x7feaf2f1f4d0>

Text(0.5,1.05,'Pixel importances with forests of trees')

3 参考

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel

https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-diabetes-py

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-faces-py

[机器学习] 特征选择笔记4-使用SelectFromModel特征选择的更多相关文章

Andrew 机器学习课程笔记
Andrew 机器学习课程笔记完成 Andrew 的课程结束至今已有一段时间,课程介绍深入浅出,很好的解释了模型的基本原理以及应用.在我看来这是个很好的入门视频,他老人家现在又出了一门 deep l ...
【读书笔记与思考】Andrew 机器学习课程笔记
Andrew 机器学习课程笔记完成 Andrew 的课程结束至今已有一段时间,课程介绍深入浅出,很好的解释了模型的基本原理以及应用.在我看来这是个很好的入门视频,他老人家现在又出了一门 deep l ...
《Python 机器学习》笔记（四）
数据预处理--构建好的训练数据集机器学习算法最终学习结果的优劣取决于两个主要因素:数据的质量和数据中蕴含的有用信息的数量. 缺失数据的处理在实际应用过程中,样本由于各种原因缺少一个或多个值得情况并 ...
机器学习&数据挖掘笔记_14（GMM-HMM语音识别简单理解）
为了对GMM-HMM在语音识别上的应用有个宏观认识,花了些时间读了下HTK(用htk完成简单的孤立词识别)的部分源码,对该算法总算有了点大概认识,达到了预期我想要的.不得不说,网络上关于语音识别的通俗 ...
机器学习&数据挖掘笔记（常见面试之机器学习算法思想简单梳理）
机器学习&数据挖掘笔记_16(常见面试之机器学习算法思想简单梳理) 作者:tornadomeet 出处:http://www.cnblogs.com/tornadomeet 前言: 找工作时( ...
[转]机器学习&数据挖掘笔记_16（常见面试之机器学习算法思想简单梳理）
机器学习&数据挖掘笔记_16(常见面试之机器学习算法思想简单梳理) 转自http://www.cnblogs.com/tornadomeet/p/3395593.html 前言: 找工作时(I ...
Andrew Ng机器学习课程笔记（五）之应用机器学习的建议
Andrew Ng机器学习课程笔记(五)之应用机器学习的建议版权声明:本文为博主原创文章,转载请指明转载地址 http://www.cnblogs.com/fydeblog/p/7368472.h ...
Andrew Ng机器学习课程笔记--week1（机器学习介绍及线性回归）
title: Andrew Ng机器学习课程笔记--week1(机器学习介绍及线性回归) tags: 机器学习, 学习笔记 grammar_cjkRuby: true --- 之前看过一遍,但是总是模 ...
Andrew Ng机器学习课程笔记--汇总
笔记总结,各章节主要内容已总结在标题之中 Andrew Ng机器学习课程笔记–week1(机器学习简介&线性回归模型) Andrew Ng机器学习课程笔记--week2(多元线性回归& ...

随机推荐

二进制安装Dokcer
写在前边考虑到很多生产环境是内网,不允许外网访问的.恰好我司正是这种场景,写一篇二进制方式安装Docker的教程,用来帮助实施同事解决容器部署的第一个难关. 本文将以二进制安装方式,在CentOS7 ...
LcdToos如何实现PX01自动调Flicker及VCOM烧录
准备工作: LcdTools+PX01点亮需调Flicker的屏:F118 Flicker探头,用于自动Flicker校准测量,F118连接PX01上电后,探头屏会提示零点校准,此时需盖住探头窗口再按 ...
Netty学习记录-入门篇
你如果,缓缓把手举起来,举到顶,再突然张开五指,那恭喜你,你刚刚给自己放了个烟花. 模块介绍 netty-bio: 阻塞型网络通信demo. netty-nio: 引入channel(通道).buff ...
虚拟机里网络连接的几种方式说明（桥接，NAT, 仅主机）
虚拟机里网络连接类型的选择: 桥接:选择桥接模式的话虚拟机和宿主机在网络上就是平级的关系,相当于连接在同一交换机上. NAT:NAT模式就是虚拟机要联网得先通过宿主机才能和外面进行通信. 仅主机:虚拟 ...
mysql 子查询联结组合查询
子查询 SELECT cust_id FROM orders WHERE order_num IN (SELECT order_num FROM orderitems WHERE prod_id='T ...
工作中，本人常用到的unzip、zip命令
1. 命令安装 1.1 zip安装 yum install zip 1.2 unzip安装 yum install unzip 2. 常用命令 2.1 常用zip命令 2.1.1 压缩文件 zip x ...
硬核剖析Java锁底层AQS源码，深入理解底层架构设计
我们常见的并发锁ReentrantLock.CountDownLatch.Semaphore.CyclicBarrier都是基于AQS实现的,所以说不懂AQS实现原理的,就不能说了解Java锁. 上篇 ...
Django更换数据库和迁移数据方案
前言双十一光顾着买东西都没怎么写文章,现在笔记里还有十几篇半成品文章没写完- 今天来分享一下 Django 项目切换数据库和迁移数据的方案,网络上找到的文章方法不一,且使用中容易遇到各类报错,本文根 ...
【笔记】入门DP（Ⅱ）
0X00 P1433 吃奶酪状压 \(DP\),把经过的点压缩成01串.若第 \(i\) 位为 \(0\) 表示未到达,为 \(1\) 则表示已到达. 用 \(f[i][j]\) 表示以 \(i\) ...
vulnhub靶场之VIKINGS: 1
准备: 攻击机:虚拟机kali.本机win10. 靶机:DRIPPING BLUES: 1,网段地址我这里设置的桥接,所以与本机电脑在同一网段,下载地址:https://download.vulnhu ...

[机器学习] 特征选择笔记4-使用SelectFromModel特征选择