1. General Parameters: Guide the overall functioning
  2. Booster Parameters: Guide the individual booster (tree/regression) at each step
  3. Learning Task Parameters: Guide the optimization performed

general parameters

  1. booster [default=gbtree](基分类器类型)

    • Select the type of model to run at each iteration. It has 2 options:

      • gbtree: tree-based models
      • gblinear: linear models
  2. silent [default=0]:
    • Silent mode is activated if  set to 1, i.e. no running messages will be printed.
    • It’s generally good to keep it 0 as the messages might help in understanding the model.
  3. nthread [default to maximum number of threads available if not set]
    • This is used for parallel processing and number of cores in the system should be entered
    • If you wish to run on all cores, value should not be entered and algorithm will detect automatically

booster parameters

Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

  1. eta [default=0.3](学习率,非常重要的参数)

    • Analogous to learning rate in GBM
    • Makes the model more robust by shrinking the weights on each step
    • Typical final values to be used: 0.01-0.2
  2. min_child_weight [default=1](控制过拟合,如果太大会导致欠拟合)
    • Defines the minimum sum of weights of all observations required in a child.(叶子节点中实例数 * 叶子节点的score)
    • This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
    • Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    • Too high values can lead to under-fitting hence, it should be tuned using CV.
  3. max_depth [default=6](最大深度,一般增加特征,深度变小)
    • The maximum depth of a tree, same as GBM.
    • Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    • Should be tuned using CV.
    • Typical values: 3-10
  4. max_leaf_nodes
    • The maximum number of terminal nodes or leaves in a tree.
    • Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    • If this is defined, GBM will ignore max_depth.
  5. gamma [default=0](分裂所需的最小损失减少值)
    • A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    • Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
  6. max_delta_step [default=0]
    • In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
    • Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    • This is generally not used but you can explore further if you wish.
  7. subsample [default=1](样本采样的比例,防止过拟合)
    • Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
    • Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
    • Typical values: 0.5-1
  8. colsample_bytree [default=1](特征采样的比例,防止过拟合)
    • Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
    • Typical values: 0.5-1
  9. colsample_bylevel [default=1]
    • Denotes the subsample ratio of columns for each split, in each level.
    • I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
  10. lambda [default=1](L2正则化系数)
    • L2 regularization term on weights (正则化系数)(analogous to Ridge regression)
    • This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
  11. alpha [default=0](L1正则化系数)
    • L1 regularization term on weight (analogous to Lasso regression)
    • Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
  12. scale_pos_weight [default=1](正例的权重,如果使用对数损失,默认1即可)
    • A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

learning task parameters

These parameters are used to define the optimization objective the metric to be calculated at each step.

  1. objective [default=reg:linear](损失函数)

    • This defines the loss function to be minimized. Mostly used values are:

      • binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
      • multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
        • you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
      • multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
  2. eval_metric [ default according to objective ]
    • The metric to be used for validation data.
    • The default values are rmse for regression and error for classification.
    • Typical values are:
      • rmse – root mean square error
      • mae – mean absolute error
      • logloss – negative log-likelihood
      • error – Binary classification error rate (0.5 threshold)
      • merror – Multiclass classification error rate
      • mlogloss – Multiclass logloss
      • auc: Area under the curve
  3. seed [default=0](随机数种子,可指定多个不同的种子,训练不同的模型,然后ensemble)
    • The random number seed.
    • Can be used for generating reproducible results and also for parameter tuning.

General Approach for Parameter Tuning

  1. Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
  2. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
  3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
  4. Lower the learning rate and decide the optimal parameters .

先初始化一个比较大的学习率,然后利用xgboost自带的cv调整树的数目,其次调整树相关的参数,包括深度、最小孩子节点权重等,最后调整学习率

import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV '''

xgb is the direct xgboost library.

XGBClassifier is an sklearn wrapper for XGBoost.This allows us to use sklearn's Grid Search with

parallel processing .

''' train = pd.read_csv("train_modified.csv")

target = "Disbursed"

IDcol = "ID" def modelfit(alg, dtrain, predictors, useTrainCV = True, cv_folds = 5, early_stopping_rounds = 50):

if useTrainCV:

xgb_param = alg.get_xgb_params()

xgtrain = xgb.DMatrix(dtrain[predictors].values, label = dtrain[target].values)

cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round = alg.get_params()['n_estimators'],

nfold = cv_folds, metrics = 'auc', early_stopping_rounds = early_stopping_rounds)

alg.set_params(n_estimators = cvresult.shape[0])
    alg.fit(dtrain[predictors], dtrain[</span><span style="color: #800000">'</span><span style="color: #800000">Disbursed</span><span style="color: #800000">'</span>], eval_metric = <span style="color: #800000">'</span><span style="color: #800000">auc</span><span style="color: #800000">'</span><span style="color: #000000">)

    dtrain_predictions </span>=<span style="color: #000000"> alg.predict(dtrain[predictors])
dtrain_predprob </span>= alg.predict_proba(dtrain[predictors])[:,1<span style="color: #000000">] </span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">\nModel Report</span><span style="color: #800000">"</span>
<span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">Accuracy : %.4g</span><span style="color: #800000">"</span> % metrics.accuracy_score(dtrain[<span style="color: #800000">'</span><span style="color: #800000">Disbursed</span><span style="color: #800000">'</span><span style="color: #000000">].values, dtrain_predictions)
</span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">AUC Score (Train): %f</span><span style="color: #800000">"</span> % metrics.roc_auc_score(dtrain[<span style="color: #800000">'</span><span style="color: #800000">Disbursed</span><span style="color: #800000">'</span><span style="color: #000000">], dtrain_predprob)

predictors = [x for x in train.columns if x not in [target, IDcol]]

xgb1 = XGBClassifier(learning_rate = 0.1, n_estimators = 1000, max_depth = 5, min_child_weight = 1, gamma = 0, subsample = 0.8,

colsample_bytree = 0.8, objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27)

modelfit(xgb1, train, predictors)

由上图可知,在给定learning_rate = 0.1的情况下,n_estimators = 120是最佳的树的个数

参考:

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

http://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

https://github.com/dmlc/xgboost/tree/master/demo/guide-python

http://xgboost.readthedocs.io/en/latest/python/python_api.html

【转载】XGBoost调参的更多相关文章

  1. Xgboost调参总结

    一.参数速查 参数分为三类: 通用参数:宏观函数控制. Booster参数:控制每一步的booster(tree/regression). 学习目标参数:控制训练目标的表现. 二.回归 from xg ...

  2. xgboost 调参参考

    XGBoost的参数 XGBoost的作者把所有的参数分成了三类: 1.通用参数:宏观函数控制. 2.Booster参数:控制每一步的booster(tree/regression). 3.学习目标参 ...

  3. 机器学习--Xgboost调参

    Xgboost参数 'booster':'gbtree', 'objective': 'multi:softmax', 多分类的问题 'num_class':10, 类别数,与 multisoftma ...

  4. xgboost 调参 !

    https://jessesw.com/XG-Boost/ http://blog.csdn.net/u010414589/article/details/51153310

  5. xgboost调参

    The overall parameters have been divided into 3 categories by XGBoost authors: General Parameters: G ...

  6. xgboost调参过程

    from http://blog.csdn.net/han_xiaoyang/article/details/52665396

  7. xgboost使用调参

    欢迎关注博主主页,学习python视频资源 https://blog.csdn.net/q383700092/article/details/53763328 调参后结果非常理想 from sklea ...

  8. XGBoost和LightGBM的参数以及调参

    一.XGBoost参数解释 XGBoost的参数一共分为三类: 通用参数:宏观函数控制. Booster参数:控制每一步的booster(tree/regression).booster参数一般可以调 ...

  9. LightGBM 调参方法(具体操作)

     sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

随机推荐

  1. 【动态规划+高精度】mr360-定长不下降子序列

    [题目大意] 韵哲君发现自己的面前有一行数字,当她正在琢磨应该干什么的时候,这时候,陈凡老师从天而降,走到了韵哲君的身边,低下头,对她耳语了几句,然后飘然而去. 陈凡老师说了什么呢,陈凡老师对韵哲君说 ...

  2. Problem D: 程序填充(递归函数):数列2项和

    Problem D: 程序填充(递归函数):数列2项和 Time Limit: 1 Sec  Memory Limit: 64 MBSubmit: 2601  Solved: 2117 Descrip ...

  3. JavaScript之String类型

    这里先总结一下RegExp类型的两个主要方法: (1)exec():主要用于捕获组.接受一个参数,这个参数是主要应用模式的字符串,然后返回包含第一个匹配项信息的数组. (2)test():主要用于目标 ...

  4. [转]基于全注解的Spring3.1 mvc、myBatis3.1、Mysql的轻量级项目

    摘要 对于现在主流的j2ee企业级开发而言,ssh(struts+hibernate+spring)依然是一个事实的标准.由struts充当的mvc调度控制:hibernate的orm持久化映射:sp ...

  5. CentOS 6.9下KVM虚拟机快照创建、删除、恢复(转)

    使用文件快照的方式实现文件备份,但单说快照(snapshot)的话,他是某一时间点(版本)你能看到的该时间点备份文件状态的全貌,通过文件的快照(全貌)你能恢复到特定时间点(版本)的文件状态. 创建虚拟 ...

  6. python笔记5-python2写csv文件中文乱码问题

    前言 python2最大的坑在于中文编码问题,遇到中文报错首先加u,再各种encode.decode. 当list.tuple.dict里面有中文时,打印出来的是Unicode编码,这个是无解的. 对 ...

  7. Kubernetes应用迁移问题定位

    这个帖子记录所有的应用迁移中遇到的问题. 关于镜像无法启动后无法定位问题 在原有的Dockerfile中修改,加入 RUN echo "aaa" > /etc/a.log C ...

  8. 三.rocketmq-console

    ⦁    rocketmq-console来源于https://github.com/rocketmq/rocketmq-console 1.配置IP 2.启动运行:出现此信息则表示成功  访问:in ...

  9. @查看MySQL版本的方法

    1.在终端下:mysql -V. [root@localhost bin]# mysql -V; mysql Ver 14.14 Distrib 5.6.21, for Linux (x86_64) ...

  10. 各种语言性能(CPU密集型程序)比较

    都进行Fib数列计算,计算到n=40的计算时间: 注意:开始,我以为上图中的第二列就是代表C++的性能.但是现在发现,完全不正确. 如果你使用同样的抽象和同样的逻辑去实现同样的代码,C和C++的性能几 ...