gbdt和xgboost中feature importance的获取
来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要
I'll use the sklearn code, as it is generally much cleaner than the R
code.
Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum
importances = total_sum / len(self.estimators_)
return importances
This is pretty easy to understand. self.estimators_
is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_
. So in the binary case, we can rewrite this all as
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances
So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.
The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
return importances
This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)
importances /= nodes[0].weighted_n_node_samples
It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.
gbdt和xgboost中feature importance的获取的更多相关文章
- arcgisJs之featureLayer中feature的获取
arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...
- XGBoost中参数调整的完整指南(包含Python中的代码)
(搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日 介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...
- 一步一步理解GB、GBDT、xgboost
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- GBDT,Adaboosting概念区分 GBDT与xgboost区别
http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...
- 机器学习(八)—GBDT 与 XGBOOST
RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性. 根据个体学习器的生成方式,目前 ...
- GB、GBDT、XGboost理解
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- 提升学习算法简述:AdaBoost, GBDT和XGBoost
1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...
- 机器学习总结(一) Adaboost,GBDT和XGboost算法
一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...
- 机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...
随机推荐
- Eclipse开发Java代码,如何添加智能提示
选择:Window->Preferences->JAVA->Editor->Context Assist 在Auto activation triggers for Java处 ...
- 【CF666E】Forensic Examination(后缀自动机,线段树合并)
[CF666E]Forensic Examination(后缀自动机,线段树合并) 题面 洛谷 CF 翻译: 给定一个串\(S\)和若干个串\(T_i\) 每次询问\(S[pl..pr]\)在\(T_ ...
- P3932 浮游大陆的68号岛 【线段树】
P3932 浮游大陆的68号岛 有一天小妖精们又在做游戏.这个游戏是这样的. 妖精仓库的储物点可以看做在一个数轴上.每一个储物点会有一些东西,同时他们之间存在距离. 每次他们会选出一个小妖精,然后剩下 ...
- pycrypto 安装
https://www.dlitz.net/software/pycrypto/ 下载pycrypto-2.6.1.tar.gz,解压后 python setup.py build python se ...
- struts2初探(一)
首先需要了解Struts2框架的运行过程: request从发送到服务器,即tomcat,然后tomcat参考web.xml,发现所有的url都需要经过struts2的过滤, Struts2调用dof ...
- EA画时序图初试
1.步骤: 1. 新建一个项目: 2. Use Case Model右键-->添加图-->左边选择UML Behavioral,右边选择Sequence: 3. 选择工具栏中的工具,点击工 ...
- 手脱PE Pack v1.0
1.PEID查壳 PE Pack v1.0 2.载入OD,一上来就这架势,先F8走着 > / je ; //入口点 -\E9 C49D0000 jmp Pepack_1.0040D000 004 ...
- JavaScript 字符串操作:substring, substr, slice
在 JavaScript 中,对于字符串的操作有 substring, substr, slice 等好多个内置函数,这里给大家推荐一篇介绍 substring, substr, slice 三者区别 ...
- 使用CSS3创建文字颜色渐变(CSS3 Text Gradient)
考虑一下,如何在网页中达到类似以下文字渐变的效果? 传统的实现中,是用一副透明渐变的图片覆盖在文字上.具体实现方式可参考 http://www.qianduan.net/css-gradient-te ...
- TreeMap put 操作分析
public V put(K key, V value) { //t 表示当前节点,记住这个很重要!先把TreeMap 的根节点root 的引用赋值给当前节点 TreeMap.Entry<K,V ...