gbdt和xgboost中feature importance的获取
来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要
I'll use the sklearn code, as it is generally much cleaner than the R
code.
Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum
importances = total_sum / len(self.estimators_)
return importances
This is pretty easy to understand. self.estimators_
is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_
. So in the binary case, we can rewrite this all as
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances
So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.
The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
return importances
This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)
importances /= nodes[0].weighted_n_node_samples
It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.
gbdt和xgboost中feature importance的获取的更多相关文章
- arcgisJs之featureLayer中feature的获取
arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...
- XGBoost中参数调整的完整指南(包含Python中的代码)
(搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日 介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...
- 一步一步理解GB、GBDT、xgboost
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- GBDT,Adaboosting概念区分 GBDT与xgboost区别
http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...
- 机器学习(八)—GBDT 与 XGBOOST
RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性. 根据个体学习器的生成方式,目前 ...
- GB、GBDT、XGboost理解
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- 提升学习算法简述:AdaBoost, GBDT和XGBoost
1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...
- 机器学习总结(一) Adaboost,GBDT和XGboost算法
一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...
- 机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...
随机推荐
- 【BZOJ2306】幸福路径(动态规划,倍增)
[BZOJ2306]幸福路径(动态规划,倍增) 题面 BZOJ 题解 不要求确切的值,只需要逼近 显然可以通过移动\(\infty\)步来达到逼近的效果 考虑每次的一步怎么移动 设\(f[i][j]\ ...
- 【BZOJ4828】【HNOI2017】大佬(动态规划)
[BZOJ4828][HNOI2017]大佬(动态规划) 题面 BZOJ 洛谷 LOJ 人们总是难免会碰到大佬.他们趾高气昂地谈论凡人不能理解的算法和数据结构,走到任何一个地方,大佬的气场 就能让周围 ...
- Redis的String、Hash类型命令
String是最简单的类型,一个Key对应一个Value,string类型是二进制安全的.Redis的string可以包含任何数据,比如jpg图片或者序列化的对象.最大上限是1G字节. Hash ...
- 《Linux内核设计与实现》第5章读书笔记
第五章 系统调用 一.系统调用概述 系统调用在Linux中称为syscall,返回的值是long型变量:如果出错,C库会将错误代码写入errno全局变量(通过调用perror()函数可以把该变量翻译成 ...
- XSS/CSRF跨站攻击和防护方案
Xss(Cross Site Scripting 跨站脚本攻击)/CSRF(Cross-site request forgery 跨站请求伪造),它与著名的SQL注入攻击类似,都是利用了Web页面的编 ...
- 【learning】矩阵树定理
问题描述 给你一个图(有向无向都ok),求这个图的生成树个数 一些概念 度数矩阵:\(a[i][i]=degree[i]\),其他等于\(0\) 入度矩阵:\(a[i][i]=in\_degree[i ...
- C++中添加配置文件读写方法
比如有一个工程,一些变量有可能需要不时的修改,这时候可以通过从配置文件中读取该数值,需要修改时只需要修改配位文件即可. 比如有一个这样的变量m_nTest; 我么可以写两个函数ReadConfig() ...
- Httpclient与RestTemplate的比较(比httpClient更优雅的Restful URL访问)
一.HttpClient (一)HttpClient 客户端 1.HttpClient 是 apache 的开源,需要引入两个包:httpclient-4.2.4.jar 和 httpcore-4.2 ...
- 调整的R方_如何选择回归模型
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...
- Android Studio 打包自定义apk文件名
使用Android Studio打包的时候,我们有时候需要自定义apk的文件名,在此记录一下. 在app的build.gradle中,根节点下使用关键词def声明一个全局变量,用于获取打包的时间,格式 ...