xgboost使用调参

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

欢迎关注博主主页，学习python视频资源

xgboost调参思路

xgboost
参数调优的一般步骤
1. 确定学习速率和提升参数调优的初始值
2. max_depth 和 min_child_weight 参数调优
3. gamma参数调优
4. subsample 和 colsample_bytree 参数优
5. 正则化参数alpha调优
6. 降低学习速率和使用更多的决策树

1.min_child_weight 默认值1
决定最小叶子节点样本权重和。和 GBM 的 min_child_leaf 参数类似，但不完全一样。XGBoost 的这个参数是最小样本权重的和，而 GBM 参数是最小样本总数。这个参数用于避免过拟合。当它的值较大时，可以避免模型学习到局部的特殊样本。但是如果这个值过高，会导致欠拟合。这个参数需要使用 CV 来调整。

min_child_weight是一個非常重要的參數直譯即：決定最小葉子節點樣本權重和。如果在一次分裂中，葉子節點上所有樣本的權重和小于min_child_weight則停止分裂，能夠有效的防止過擬合，防止學到特殊樣本。数值越大，算法越保守。

孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative
取值范围为: [0,∞]

2.eta [默认 0.3]
和 GBM 中的 learning rate 参数类似。通过减少每一步的权重，可以提高模型的稳定性。典型值为 0.01-0.2

3. colsample_bytree [默认 1]
列采样率，也就是特征采样率
和 GBM 里面的 max_features 参数类似。用来控制每棵随机采样的列数的占比 (每一列是一个特征)。典型值：0.5-1

4. subsample [默认 1]
和 GBM 中的 subsample 参数一模一样。这个参数控制对于每棵树，随机采样的比例。减小这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置得过小，它可能会导致欠拟合。典型值：0.5-1
构建每棵树对样本的采样率，如果设置成0.5，XGBoost会随机选择一半的样本作为训练集。

5. alpha [默认 1]
权重的 L1 正则化项。(和 Lasso regression 类似)。可以应用在很高维度的情况下，使得算法的速度更快。

6.lambda [default=1, alias: reg_lambda]
L2正则化，这个参数是用来控制XGBoost的正则化部分的。虽然大部分数据科学家很少用到这个参数，但是这个参数在减少过拟合上还是可以挖掘出更多用处的。

6. gamma [默认 0，alias: min_split_loss]
在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma 指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。

分裂节点时，损失函数减小值只有大于等于gamma节点才分裂，gamma值越大，算法越保守，越不容易过拟合，但性能就不一定能保证，需要平衡。

7.eval_metric [默认值取决于 objective 参数的取值]
对于有效数据的度量方法。对于回归问题，默认值是 rmse，对于分类问题，默认值是 error。典型值有：
rmse 均方根误差、mae 平均绝对误差、logloss 负对数似然函数值、error 二分类错误率 (阈值为 0.5)、merror 多分类错误率、mlogloss 多分类 logloss 损失函数、auc 曲线下面积

8. scale_pos_weight [默认 1]
在类别高度不平衡的情况下，将参数设置大于0，可以加快收敛
在各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛

9.num_boost_round
xgboost学习率与迭代次数
xgboost是一种迭代改进的算法，它每次给错误分类的实例增加权重，并进入下一次迭代。因此需要指定迭代次数和每次权重改进的程度（学习率）。
迭代次数通过num_boost_round设置，次数越多，花费时间越长，xgboost除了可以设置固定迭代次数以外，还可以根据评估，判断如果n次不再改进，则停止迭代（具体见eval部分）。
学习率通过eta设置，它是每次迭代之后对原模型的改进程度，学习率越高收敛越快，但也可能因为粒度太大，错过极值点。
调参方法是先粗调再细调：一开始将学习率设大一点，比如0.1-0.3；次数设少一点，比如50次，即能快速改善模型又不花太长时间。后面细调时再变小学习率，增加迭代次数。

10.Early_stopping_rounds: 提前终止程序
如果有评价数据，可以提前终止程序，这样可以找到最优的迭代次数。如果要提前终止程序必须至少有一个评价数据在参数evals中。超过一个则使用最后一个。

train(..., evals=evals, early_stopping_rounds=10)

此模型下，机器会一直学习到 validation score 不再增长。每经过 early_stopping_rounds 轮，误差应该都有所下降，否则不应该继续学习。

如果出现了提前终止，模型会出现两个额外情况：bst.best_score 和 bst.best_iteration. 注意此处 train() 会返回最后一次循环的模型，而非最好的循环的模型。

此方法适用于各类最低 (RMSE, log loss, etc.) 或最高 (MAP, NDCG, AUC) 误差计算。

11. max_depth [default=6]

树的最大深度，值越大，树越大，模型越复杂可以用来防止过拟合，典型值是3-10。

https://blog.csdn.net/q383700092/article/details/53763328

调参后结果非常理想

from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_breast_cancer

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn import metrics

cancer=load_breast_cancer()

X, y = cancer.data,cancer.target

train_x, test_x, train_y, test_y=train_test_split(X,y,test_size=0.3,random_state=0)

parameters= [{'learning_rate':[0.01,0.1,0.3],

              'n_estimators':[1000,1200,1500,2000,2500],

              'max_depth':range(1,10,1),

              'gamma':[0.01,0.1,0.3,0.5],

              'eta': [0.025,0.1,0.2,0.3]}]

clf = GridSearchCV(XGBClassifier(

             min_child_weight=1,

             subsample=0.6,

             colsample_bytree=0.6,

             objective= 'binary:logistic', #逻辑回归损失函数

             scale_pos_weight=1,

             reg_alpha=0,

             reg_lambda=1,

             seed=27

            ),

            param_grid=parameters,scoring='roc_auc')  

clf.fit(train_x,train_y)

y_pred=clf.predict(test_x)

print("accuracy on the training subset:{:.3f}".format(clf.score(train_x,train_y)))

print("accuracy on the test subset:{:.3f}".format(clf.score(test_x,test_y)))

print(clf.best_params_)

y_pre= clf.predict(test_x)

y_pro= clf.predict_proba(test_x)[:,1]

print ("AUC Score : %f" % metrics.roc_auc_score(test_y, y_pro))

print("Accuracy : %.4g" % metrics.accuracy_score(test_y, y_pre)) 

'''

accuracy on the training subset:1.000

accuracy on the test subset:0.998

{'gamma': 0.5, 'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 1000}

AUC Score : 0.998089

Accuracy : 0.9708

'''

best_xgb=XGBClassifier(min_child_weight=1,

             subsample=0.6,

             colsample_bytree=0.6,

             objective= 'binary:logistic', #逻辑回归损失函数

             scale_pos_weight=1,

             reg_alpha=0,

             reg_lambda=1,

             seed=27,gamma=0.5,learning_rate=0.01,max_depth=2,n_estimators=1000)

best_xgb.fit(train_x,train_y)

print("accuracy on the training subset:{:.3f}".format(best_xgb.score(train_x,train_y)))

print("accuracy on the test subset:{:.3f}".format(best_xgb.score(test_x,test_y)))

'''

accuracy on the training subset:0.995

accuracy on the test subset:0.979

'''

github：https://github.com/dmlc/xgboost
论文参考：http://www.kaggle.com/blobs/download/forum-message-attachment-files/4087/xgboost-paper.pdf

基本思路及优点

http://blog.csdn.net/q383700092/article/details/60954996
参考http://dataunion.org/15787.html
http://blog.csdn.net/china1000/article/details/51106856
在有监督学习中，我们通常会构造一个目标函数和一个预测函数，使用训练样本对目标函数最小化学习到相关的参数，然后用预测函数和训练样本得到的参数来对未知的样本进行分类的标注或者数值的预测。
1. Boosting Tree构造树来拟合残差，而Xgboost引入了二阶导来进行求解，并且引入了节点的数目、参数的L2正则来评估模型的复杂度,构造Xgboost的预测函数与目标函数。
2. 在分裂点选择的时候也以目标函数最小化为目标。
优点：
1. 显示的把树模型复杂度作为正则项加到优化目标中。
2. 公式推导中用到了二阶导数，用了二阶泰勒展开。（GBDT用牛顿法貌似也是二阶信息）
3. 实现了分裂点寻找近似算法。
4. 利用了特征的稀疏性。
5. 数据事先排序并且以block形式存储，有利于并行计算。
6. 基于分布式通信框架rabit，可以运行在MPI和yarn上。（最新已经不基于rabit了）
7. 实现做了面向体系结构的优化，针对cache和内存做了性能优化。

原理推导及与GBDT区别

http://blog.csdn.net/q383700092/article/details/60954996
参考http://dataunion.org/15787.html
https://www.zhihu.com/question/41354392

参数说明

参考http://blog.csdn.net/han_xiaoyang/article/details/52665396
参数
booster：默认 gbtree效果好 (linear booster很少用到)
gbtree：基于树的模型
gbliner：线性模型
silent[默认0]
nthread[默认值为最大可能的线程数]
eta[默认0.3] 学习率典型值为0.01-0.2
min_child_weight[默认 1 ] 决定最小叶子节点样本权重和值较大，避免过拟合值过高，会导致欠拟合
max_depth[默认6]
gamma[默认0] 指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守
subsample[默认1] 对于每棵树，随机采样的比例减小，算法保守，避免过拟合。值设置得过小，它会导致欠拟合典型值：0.5-1
colsample_bytree[默认1] 每棵随机采样的列数的占比
colsample_bylevel[默认1] 树的每一级的每一次分裂，对列数的采样的占比
lambda[默认1] 权重的L2正则化项
alpha[默认1] 权重的L1正则化项
scale_pos_weight[默认1] 在各类别样本十分不平衡时，参数设定为一个正值，可以使算法更快收敛
objective[默认reg:linear] 最小化的损失函数
binary:logistic 二分类的逻辑回归，返回预测的概率(不是类别)。 multi:softmax 使用softmax的多分类器，返回预测的类别(不是概率)。
在这种情况下，你还需要多设一个参数：num_class(类别数目)。 multi:softprob 和multi:softmax参数一样，但是返回的是每个数据属于各个类别的概率。
eval_metric[默认值取决于objective参数的取值]
对于回归问题，默认值是rmse，对于分类问题，默认值是error。典型值有：
rmse 均方根误差 mae 平均绝对误差 logloss 负对数似然函数值
error 二分类错误率 merror 多分类错误率 mlogloss 多分类logloss损失函数 auc 曲线下面积
seed(默认0) 随机数的种子设置它可以复现随机数据的结果

sklearn包，XGBClassifier会改变的函数名
eta ->learning_rate
lambda->reg_lambda
alpha->reg_alpha

常用调整参数：

参考
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

第一步：确定学习速率和tree_based 参数调优的估计器数目

树的最大深度一般3-10
max_depth = 5
节点分裂所需的最小损失函数下降值0.1到0.2
gamma = 0
采样
subsample= 0.8,
colsample_bytree = 0.8
比较小的值，适用极不平衡的分类问题
min_child_weight = 1
类别十分不平衡
scale_pos_weight = 1

from xgboost import XGBClassifier

xgb1 = XGBClassifier(

 learning_rate =0.1,

 n_estimators=1000,

 max_depth=5,

 min_child_weight=1,

 gamma=0,

 subsample=0.8,

 colsample_bytree=0.8,

 objective= 'binary:logistic',

 nthread=4,

 scale_pos_weight=1,

 seed=27)

第二步： max_depth 和 min_weight 参数调优

grid_search参考
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
http://blog.csdn.net/abcjennifer/article/details/23884761
网格搜索scoring=’roc_auc’只支持二分类，多分类需要修改scoring(默认支持多分类)

param_test1 = {

 'max_depth':range(3,10,2),

 'min_child_weight':range(1,6,2)

}

#param_test2 = {

 'max_depth':[4,5,6],

 'min_child_weight':[4,5,6]

}

from sklearn import svm, grid_search, datasets

from sklearn import grid_search

gsearch1 = grid_search.GridSearchCV(

estimator = XGBClassifier(

learning_rate =0.1,

n_estimators=140, max_depth=5,

min_child_weight=1,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27),

param_grid = param_test1,

scoring='roc_auc',

n_jobs=4,

iid=False,

cv=5)

gsearch1.fit(train[predictors],train[target])

gsearch1.grid_scores_, gsearch1.best_params_,gsearch1.best_score_

#网格搜索scoring='roc_auc'只支持二分类，多分类需要修改scoring(默认支持多分类)

第三步：gamma参数调优

param_test3 = {

 'gamma':[i/10.0 for i in range(0,5)]

}

gsearch3 = GridSearchCV(

estimator = XGBClassifier(

learning_rate =0.1,

n_estimators=140,

max_depth=4,

min_child_weight=6,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27),

param_grid = param_test3,

scoring='roc_auc',

n_jobs=4,

iid=False,

cv=5)

gsearch3.fit(train[predictors],train[target])

gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

第四步：调整subsample 和 colsample_bytree 参数

#取0.6,0.7,0.8,0.9作为起始值

param_test4 = {

 'subsample':[i/10.0 for i in range(6,10)],

 'colsample_bytree':[i/10.0 for i in range(6,10)]

}

gsearch4 = GridSearchCV(

estimator = XGBClassifier(

learning_rate =0.1,

n_estimators=177,

max_depth=3,

min_child_weight=4,

gamma=0.1,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27),

param_grid = param_test4,

scoring='roc_auc',

n_jobs=4,

iid=False,

cv=5)

gsearch4.fit(train[predictors],train[target])

gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

第五步：正则化参数调优

param_test6 = {

 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]

}

gsearch6 = GridSearchCV(

estimator = XGBClassifier(

learning_rate =0.1,

n_estimators=177,

max_depth=4,

min_child_weight=6,

gamma=0.1,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27),

param_grid = param_test6,

scoring='roc_auc',

n_jobs=4,

iid=False,

cv=5)

gsearch6.fit(train[predictors],train[target])

gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

第六步：降低学习速率

xgb4 = XGBClassifier(

 learning_rate =0.01,

 n_estimators=5000,

 max_depth=4,

 min_child_weight=6,

 gamma=0,

 subsample=0.8,

 colsample_bytree=0.8,

 reg_alpha=0.005,

 objective= 'binary:logistic',

 nthread=4,

 scale_pos_weight=1,

 seed=27)

modelfit(xgb4, train, predictors)

python示例

import xgboost as xgb

import pandas as pd

#获取数据

from sklearn import cross_validation

from sklearn.datasets import load_iris

iris = load_iris()

#切分数据集

X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

#设置参数

m_class = xgb.XGBClassifier(

 learning_rate =0.1,

 n_estimators=1000,

 max_depth=5,

 gamma=0,

 subsample=0.8,

 colsample_bytree=0.8,

 objective= 'binary:logistic',

 nthread=4,

 seed=27)

#训练

m_class.fit(X_train, y_train)

test_21 = m_class.predict(X_test)

print "Accuracy : %.2f" % metrics.accuracy_score(y_test, test_21)

#预测概率

#test_2 = m_class.predict_proba(X_test)

#查看AUC评价标准

from sklearn import metrics

print "Accuracy : %.2f" % metrics.accuracy_score(y_test, test_21)

##必须二分类才能计算

##print "AUC Score (Train): %f" % metrics.roc_auc_score(y_test, test_2)

#查看重要程度

feat_imp = pd.Series(m_class.booster().get_fscore()).sort_values(ascending=False)

feat_imp.plot(kind='bar', title='Feature Importances')

import matplotlib.pyplot as plt

plt.show()

#回归

#m_regress = xgb.XGBRegressor(n_estimators=1000,seed=0)

#m_regress.fit(X_train, y_train)

#test_1 = m_regress.predict(X_test)

整理

xgb原始

from sklearn.model_selection import train_test_split

from sklearn import metrics

from  sklearn.datasets  import  make_hastie_10_2

import xgboost as xgb

#记录程序运行时间

import time

start_time = time.time()

X, y = make_hastie_10_2(random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)##test_size测试集合所占比例

#xgb矩阵赋值

xgb_train = xgb.DMatrix(X_train, label=y_train)

xgb_test = xgb.DMatrix(X_test,label=y_test)

##参数

params={

'booster':'gbtree',

'silent':1 ,#设置成1则没有运行信息输出，最好是设置为0.

#'nthread':7,# cpu 线程数 默认最大

'eta': 0.007, # 如同学习率

'min_child_weight':3,

# 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言

#，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。

#这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。

'max_depth':6, # 构建树的深度，越大越容易过拟合

'gamma':0.1,  # 树的叶子节点上作进一步分区所需的最小损失减少,越大越保守，一般0.1、0.2这样子。

'subsample':0.7, # 随机采样训练样本

'colsample_bytree':0.7, # 生成树时进行的列采样

'lambda':2,  # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。

#'alpha':0, # L1 正则项参数

#'scale_pos_weight':1, #如果取值大于0的话，在类别样本不平衡的情况下有助于快速收敛。

#'objective': 'multi:softmax', #多分类的问题

#'num_class':10, # 类别数，多分类与 multisoftmax 并用

'seed':1000, #随机种子

#'eval_metric': 'auc'

}

plst = list(params.items())

num_rounds = 100 # 迭代次数

watchlist = [(xgb_train, 'train'),(xgb_test, 'val')]

#训练模型并保存

# early_stopping_rounds 当设置的迭代次数较大时，early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练

model = xgb.train(plst, xgb_train, num_rounds, watchlist,early_stopping_rounds=100,pred_margin=1)

#model.save_model('./model/xgb.model') # 用于存储训练出的模型

print "best best_ntree_limit",model.best_ntree_limit

y_pred = model.predict(xgb_test,ntree_limit=model.best_ntree_limit)

print ('error=%f' % (  sum(1 for i in range(len(y_pred)) if int(y_pred[i]>0.5)!=y_test[i]) /float(len(y_pred))))

#输出运行时长

cost_time = time.time()-start_time

print "xgboost success!",'\n',"cost time:",cost_time,"(s)......"

xgb使用sklearn接口(推荐)

官方
会改变的函数名是：
eta -> learning_rate
lambda -> reg_lambda
alpha -> reg_alpha

from sklearn.model_selection import train_test_split

from sklearn import metrics

from  sklearn.datasets  import  make_hastie_10_2

from xgboost.sklearn import XGBClassifier

X, y = make_hastie_10_2(random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)##test_size测试集合所占比例

clf = XGBClassifier(

silent=0 ,#设置成1则没有运行信息输出，最好是设置为0.是否在运行升级时打印消息。

#nthread=4,# cpu 线程数 默认最大

learning_rate= 0.3, # 如同学习率

min_child_weight=1,

# 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言

#，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。

#这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。

max_depth=6, # 构建树的深度，越大越容易过拟合

gamma=0,  # 树的叶子节点上作进一步分区所需的最小损失减少,越大越保守，一般0.1、0.2这样子。

subsample=1, # 随机采样训练样本 训练实例的子采样比

max_delta_step=0,#最大增量步长，我们允许每个树的权重估计。

colsample_bytree=1, # 生成树时进行的列采样

reg_lambda=1,  # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。

#reg_alpha=0, # L1 正则项参数

#scale_pos_weight=1, #如果取值大于0的话，在类别样本不平衡的情况下有助于快速收敛。平衡正负权重

#objective= 'multi:softmax', #多分类的问题 指定学习任务和相应的学习目标

#num_class=10, # 类别数，多分类与 multisoftmax 并用

n_estimators=100, #树的个数

seed=1000 #随机种子

#eval_metric= 'auc'

)

clf.fit(X_train,y_train,eval_metric='auc')

#设置验证集合 verbose=False不打印过程

clf.fit(X_train, y_train,eval_set=[(X_train, y_train), (X_val, y_val)],eval_metric='auc',verbose=False)

#获取验证集合结果

evals_result = clf.evals_result()

y_true, y_pred = y_test, clf.predict(X_test)

print"Accuracy : %.4g" % metrics.accuracy_score(y_true, y_pred)

#回归

#m_regress = xgb.XGBRegressor(n_estimators=1000,seed=0)

网格搜索

可以先固定一个参数最优化后继续调整
第一步：确定学习速率和tree_based 给个常见初始值根据是否类别不平衡调节
max_depth,min_child_weight,gamma,subsample,scale_pos_weight
max_depth=3 起始值在4-6之间都是不错的选择。
min_child_weight比较小的值解决极不平衡的分类问题eg:1
subsample, colsample_bytree = 0.8: 这个是最常见的初始值了
scale_pos_weight = 1: 这个值是因为类别十分不平衡。
第二步： max_depth 和 min_weight 对最终结果有很大的影响
‘max_depth’:range(3,10,2),
‘min_child_weight’:range(1,6,2)
先大范围地粗调参数，然后再小范围地微调。
第三步：gamma参数调优
‘gamma’:[i/10.0 for i in range(0,5)]
第四步：调整subsample 和 colsample_bytree 参数
‘subsample’:[i/100.0 for i in range(75,90,5)],
‘colsample_bytree’:[i/100.0 for i in range(75,90,5)]
第五步：正则化参数调优
‘reg_alpha’:[1e-5, 1e-2, 0.1, 1, 100]
‘reg_lambda’
第六步：降低学习速率
learning_rate =0.01,

from sklearn.model_selection import GridSearchCV

tuned_parameters= [{'n_estimators':[100,200,500],

                  'max_depth':[3,5,7], ##range(3,10,2)

                  'learning_rate':[0.5, 1.0],

                  'subsample':[0.75,0.8,0.85,0.9]

                  }]

tuned_parameters= [{'n_estimators':[100,200,500,1000]

                  }]

clf = GridSearchCV(XGBClassifier(silent=0,nthread=4,learning_rate= 0.5,min_child_weight=1, max_depth=3,gamma=0,subsample=1,colsample_bytree=1,reg_lambda=1,seed=1000), param_grid=tuned_parameters,scoring='roc_auc',n_jobs=4,iid=False,cv=5)

clf.fit(X_train, y_train)

##clf.grid_scores_, clf.best_params_, clf.best_score_

print(clf.best_params_)

y_true, y_pred = y_test, clf.predict(X_test)

print"Accuracy : %.4g" % metrics.accuracy_score(y_true, y_pred)

y_proba=clf.predict_proba(X_test)[:,1]

print "AUC Score (Train): %f" % metrics.roc_auc_score(y_true, y_proba)

from sklearn.model_selection import GridSearchCV

parameters= [{'learning_rate':[0.01,0.1,0.3],'n_estimators':[1000,1200,1500,2000,2500]}]

clf = GridSearchCV(XGBClassifier(

             max_depth=3,

             min_child_weight=1,

             gamma=0.5,

             subsample=0.6,

             colsample_bytree=0.6,

             objective= 'binary:logistic', #逻辑回归损失函数

             scale_pos_weight=1,

             reg_alpha=0,

             reg_lambda=1,

             seed=27

            ),

            param_grid=parameters,scoring='roc_auc')

clf.fit(X_train, y_train)

print(clf.best_params_)

y_pre= clf.predict(X_test)

y_pro= clf.predict_proba(X_test)[:,1]

print "AUC Score : %f" % metrics.roc_auc_score(y_test, y_pro)

print"Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pre)

输出特征重要性

import pandas as pd

import matplotlib.pylab as plt

feat_imp = pd.Series(clf.booster().get_fscore()).sort_values(ascending=False)

#新版需要转换成dict or list

#feat_imp = pd.Series(dict(clf.get_booster().get_fscore())).sort_values(ascending=False)

#plt.bar(feat_imp.index, feat_imp)

feat_imp.plot(kind='bar', title='Feature Importances')

plt.ylabel('Feature Importance Score')

plt.show()

python风控评分卡建模和风控常识

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

微信扫二维码，免费学习更多python资源