
Early Stopping

Early Stopping是机器学习迭代式训练模型中很常见的防止过拟合技巧,维基百科里如下描述:

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent.


XGBoost Python关于early stopping的参数设置文档非常清晰,API如下:

# code snippets from xgboost python-package training.py
def train(..., evals=(), early_stopping_rounds=None)
"""Train a booster with given parameters.
early_stopping_rounds: int
Activates early stopping. Validation error needs to decrease at least
every <early_stopping_rounds> round(s) to continue training.

Sklearn的GBDT实现虽然可以添加early stopping,但是比较复杂。官方没有相应的文档和代码样例,必须看源码。实现的时候需要用户提供monitor回调函数,且要了解源码内部_fit_stages函数的locals,总之对新手很不友好:

#code snippets from sklearn.ensemble.gradient_boosting
class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
"""Abstract base class for Gradient Boosting. """
def fit(self, X, y, sample_weight=None, monitor=None):
"""Fit the gradient boosting model.
monitor : callable, optional
The monitor is called after each iteration with the current
iteration, a reference to the estimator and the local variables of
``_fit_stages`` as keyword arguments ``callable(i, self,
locals())``. If the callable returns ``True`` the fitting procedure
is stopped. The monitor can be used for various things such as
computing held-out estimates, early stopping, model introspect, and

对Sklearn感兴趣的可以看这篇文章Using Gradient Boosting (with Early Stopping),里面有回调函数monitor的参考实现。





The parameter νν can be regarded as controlling the leanring rate of the boosting procedure



#code snippets from sklearn.ensemble.gradient_boosting
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
"""Gradient Boosting for classification.""" def __init__(self, ..., learning_rate=0.1, n_estimators=100, ...):
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by `learning_rate`.
There is a trade-off between learning_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting
is fairly robust to over-fitting so a large number usually
results in better performance


Subsampling其实源于bootstrap averaging(bagging)思想,GBDT里的做法是在每一轮建树时,样本是从训练集合中无放回随机抽样的ηη部分,典型的ηη值是0.5。这样做既能对模型起正则作用,也能减少计算时间。

事实上,XGBoost和Sklearn的实现均借鉴了随机森林,除了有样本层次上的采样,也有特征采样。也就是说建树的时候只从随机选取的一些特征列寻找最优分裂。 下面是Sklearn里的相关参数设置的片段,

#code snippets from sklearn.ensemble.gradient_boosting
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
"""Gradient Boosting for classification.""" def __init__(self, ..., subsample=1.0, max_features=None,...):
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
learners. If smaller than 1.0 this results in Stochastic Gradient
Boosting. `subsample` interacts with the parameter `n_estimators`.
Choosing `subsample < 1.0` leads to a reduction of variance
and an increase in bias.
max_features : int, float, string or None, optional (default=None)
The number of features to consider when looking for the best split:

Regularized Learning Objective








Dropout是deep learning里很常用的正则化技巧,很自然的我们会想能不能把Dropout用到GBDT模型上呢?AISTATS2015有篇文章DART: Dropouts meet Multiple Additive Regression Trees进行了一些尝试。


Trees added at later iterations tend to impact the prediction of only a few instances, and they make negligible contribution towards the prediction of all the remaining instances. We call this issue of subsequent trees affecting the prediction of only a small fraction of the training instances over-specialization.



DART divergesfrom MART at two places. First, when computing the gradient that the next tree will fit, only a random subset of the existing ensemble is considered. The second place at which DART diverges from MART is when adding the new tree to the ensemble where DART performs a normalization step.



