

Gradient boosting decision tree

1.main diea

The main idea behind GBDT is to combine many simple models(also known as week kernels),like shallow trees.Each tree can only provide good predictions on part of the data,and so more and more trees are added to iteratively improve performance.

2.parameters setting

the algorithm is a bit more sensitive to parameter settings than random forests,but can provide better accuracy if the parameters are set correctly.

  • number of trees

    By increasing n_estimators ,also increasing the model complexity,as the model has more chances to correct misticks on the training set.
  • learning rate

    controns how strongly each tree tries to correct the misticks of the previous trees.A higher learning rate means each tree can make stronger correctinos,allowing for more complex models.
  • max_depth

    or alternatively max_leaf_nodes.Usyally max_depth is set very low for gradient-boosted models,often not deeper than five splits.


from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

In [261]: X_train,X_test,y_train,y_test=train_test_split(,,random_state=0)
...: gbrt=GradientBoostingClassifier(random_state=0)
...: gbrt.score(X_test,y_test)
Out[261]: 0.958041958041958 In [262]: gbrt.feature_importances_
array([0.01337291, 0.04201687, 0.0208666 , 0.01889077, 0.01028091,
0.03215986, 0.02074619, 0.11678956, 0.00820024, 0.00074312,
0.02042134, 0.00680047, 0.02023052, 0.03907398, 0.05406751,
0.04795741, 0.02358101, 0.00934718, 0.00593481, 0.0239241 ,
0.05354265, 0.06160083, 0.10961728, 0.07395201, 0.01867851,
0.03842953, 0.01915824, 0.07128703, 0.01773659, 0.00059199]) In [263]: gbrt.learning_rate
Out[263]: 0.1 In [264]: gbrt.max_depth
Out[264]: 3 In [265]: len(gbrt.estimators_)
Out[266]: 100 In [272]: gbrt.get_params()
{'criterion': 'friedman_mse',
'init': None,
'learning_rate': 0.1,
'loss': 'deviance',
'max_depth': 3,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'presort': 'auto',
'random_state': 0,
'subsample': 1.0,
'verbose': 0,
'warm_start': False}

Random forest

In [230]: y
array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int64) In [231]: axes.ravel()
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F3694A8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46C099F28>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46E6E3BE0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46BEB72E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46ED67198>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F292C88>],
dtype=object) In [232]: from sklearn.model_selection import train_test_split In [233]: X_trai,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42) In [234]: len(X_trai)
Out[234]: 75 In [235]: fores=RandomForestClassifier(n_estimators=5,random_state=2) In [236]:,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
oob_score=False, random_state=2, verbose=0, warm_start=False) In [237]: fores.score(X_test,y_test)
Out[237]: 0.92 In [238]: fores.estimators_
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1872583848, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=794921487, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=111352301, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1853453896, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=213298710, splitter='best')]


