xgboost参数

  • 选择较高的学习速率(learning rate)。一般情况下,学习速率的值为0.1。但是,对于不同的问题,理想的学习速率有时候会在0.05到0.3之间波动。选择对应于此学习速率的理想决策树数量。XGBoost有一个很有用的函数“cv”,这个函数可以在每一次迭代中使用交叉验证,并返回理想的决策树数量。

  • 对于给定的学习速率和决策树数量,进行决策树特定参数调优(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。在确定一棵树的过程中,我们可以选择不同的参数,待会儿我会举例说明。

  • xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度,从而提高模型的表现。

  • 降低学习速率,确定理想参数。

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

  • ①使用xgboost自带的数据集格式 + xgboost自带的建模方式

    • 把数据读取成xgb.DMatrix格式(libsvm/dataframe.values给定X和Y)
    • 准备好一个watch_list(观测和评估的数据集)
    • xgb.train(dtrain)
    • xgb.predict(dtest)
  • ②使用pandas的DataFrame格式 + xgboost的sklearn接口
    • estimator = xgb.XGBClassifier()/xgb.XGBRegressor()
    • estimator.fit(df_train.values, df_target.values)
#!/usr/bin/python
import numpy as np
#import scipy.sparse
import pickle
import xgboost as xgb # 基本例子,从libsvm文件中读取数据,做二分类
# 数据是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1 # 转换成Dmatrix格式
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test') # 超参数设定
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' } # 设定watchlist用于查看模型状态
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist) # 使用模型预测
preds = bst.predict(dtest) # 判断准确率
labels = dtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds)))) # 模型存储
bst.save_model('./model/0001.model')
[15:49:14] 6513x127 matrix with 143286 entries loaded from ./data/agaricus.txt.train
[15:49:14] 1611x127 matrix with 35442 entries loaded from ./data/agaricus.txt.test
[0] eval-error:0.042831 train-error:0.046522
[1] eval-error:0.021726 train-error:0.022263
错误类为0.021726

2.配合pandas DataFrame格式数据建模

# 皮马印第安人糖尿病数据集 包含很多字段:怀孕次数 口服葡萄糖耐量试验中血浆葡萄糖浓度 舒张压(mm Hg) 三头肌组织褶厚度(mm)
# 2小时血清胰岛素(μU/ ml) 体重指数(kg/(身高(m)^2) 糖尿病系统功能 年龄(岁)
import pandas as pd
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split # 基本例子,从csv文件中读取数据,做二分类 # 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv') # 做数据切分
train, test = train_test_split(data) # 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values) # 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' } # 设定watchlist用于查看模型状态
watchlist = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist) # 使用模型预测
preds = bst.predict(xgtest) # 判断准确率
labels = xgtest.get_label()
print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds)))) # 模型存储
bst.save_model('./model/0002.model')
[0]	eval-error:0.322917	train-error:0.21875
[1] eval-error:0.244792 train-error:0.168403
[2] eval-error:0.255208 train-error:0.182292
[3] eval-error:0.270833 train-error:0.170139
[4] eval-error:0.244792 train-error:0.144097
[5] eval-error:0.25 train-error:0.145833
[6] eval-error:0.229167 train-error:0.144097
[7] eval-error:0.25 train-error:0.145833
[8] eval-error:0.239583 train-error:0.147569
[9] eval-error:0.234375 train-error:0.140625
错误类为0.234375

3.使用xgboost的sklearn包

#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib # 基本例子,从csv文件中读取数据,做二分类 # 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv') # 做数据切分
train, test = train_test_split(data) # 取出特征X和目标y的部分
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
train_X = train[feature_columns].values
train_y = train[target_column].values
test_X = test[feature_columns].values
test_y = test[target_column].values # 初始化模型
xgb_classifier = xgb.XGBClassifier(n_estimators=20,\
max_depth=4, \
learning_rate=0.1, \
subsample=0.7, \
colsample_bytree=0.7) # 拟合模型
xgb_classifier.fit(train_X, train_y) # 使用模型预测
preds = xgb_classifier.predict(test_X) # 判断准确率
print ('错误类为%f' %((preds!=test_y).sum()/float(test_y.shape[0]))) # 模型存储
joblib.dump(xgb_classifier, './model/0003.model')
错误类为0.276042

['./model/0003.model']

4.交叉验证

xgb.cv(param, dtrain, num_round, nfold=5,metrics={'error'}, seed = 0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
train-error-mean train-error-std test-error-mean test-error-std
0 0.006832 0.001012 0.006756 0.001407
1 0.002994 0.002806 0.002303 0.002524
2 0.001382 0.000352 0.001382 0.001228
3 0.001190 0.000658 0.001382 0.001228
4 0.001382 0.000282 0.001075 0.000921
5 0.000921 0.000506 0.001228 0.001041
6 0.000921 0.000506 0.001228 0.001041
7 0.000921 0.000506 0.001228 0.001041
8 0.000921 0.000506 0.001228 0.001041
9 0.000921 0.000506 0.001228 0.001041

5.添加预处理的交叉验证

# 计算正负样本比,调整样本权重
def fpreproc(dtrain, dtest, param):
label = dtrain.get_label()
ratio = float(np.sum(label == 0)) / np.sum(label==1)
param['scale_pos_weight'] = ratio
return (dtrain, dtest, param) # 先做预处理,计算样本权重,再做交叉验证
xgb.cv(param, dtrain, num_round, nfold=5, metrics={'auc'}, seed = 0, fpreproc = fpreproc)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
train-auc-mean train-auc-std test-auc-mean test-auc-std
0 0.999772 0.000126 0.999731 0.000191
1 0.999942 0.000044 0.999909 0.000085
2 0.999964 0.000035 0.999926 0.000084
3 0.999979 0.000036 0.999950 0.000089
4 0.999976 0.000043 0.999946 0.000098
5 0.999994 0.000010 0.999988 0.000020
6 0.999993 0.000012 0.999988 0.000020
7 0.999993 0.000012 0.999988 0.000020
8 0.999993 0.000012 0.999988 0.000020
9 0.999993 0.000012 0.999988 0.000020

6.自定义损失函数与评估准则

print ('running cross validation, with cutomsized loss function')
# 自定义损失函数,需要提供损失函数的一阶导和二阶导
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0-preds)
return grad, hess # 自定义评估准则,评估预估值和标准答案之间的差距
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'error', float(sum(labels != (preds > 0.0))) / len(labels) watchlist = [(dtest,'eval'), (dtrain,'train')]
param = {'max_depth':3, 'eta':0.1, 'silent':1}
num_round = 5
# 自定义损失函数训练
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
# 交叉验证
xgb.cv(param, dtrain, num_round, nfold = 5, seed = 0, obj = logregobj, feval=evalerror)
running cross validation, with cutomsized loss function
[0] eval-rmse:0.306902 train-rmse:0.306163 eval-error:0.518312 train-error:0.517887
[1] eval-rmse:0.17919 train-rmse:0.177276 eval-error:0.518312 train-error:0.517887
[2] eval-rmse:0.172566 train-rmse:0.171727 eval-error:0.016139 train-error:0.014433
[3] eval-rmse:0.269611 train-rmse:0.271113 eval-error:0.016139 train-error:0.014433
[4] eval-rmse:0.396904 train-rmse:0.398245 eval-error:0.016139 train-error:0.014433

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
train-error-mean train-error-std train-rmse-mean train-rmse-std test-error-mean test-error-std test-rmse-mean test-rmse-std
0 0.517887 0.001085 0.308880 0.005170 0.517886 0.004343 0.309038 0.005207
1 0.517887 0.001085 0.176504 0.002046 0.517886 0.004343 0.177802 0.003767
2 0.014433 0.000223 0.172680 0.003719 0.014433 0.000892 0.174890 0.009391
3 0.014433 0.000223 0.275761 0.001776 0.014433 0.000892 0.276689 0.005918
4 0.014433 0.000223 0.399889 0.003369 0.014433 0.000892 0.400118 0.006243

7.只用前n颗树预测

#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split # 基本例子,从csv文件中读取数据,做二分类 # 用pandas读入数据
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv') # 做数据切分
train, test = train_test_split(data) # 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values) # 参数设定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' } # 设定watchlist用于查看模型状态
watchlist = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist) # 只用第1颗树预测
ypred1 = bst.predict(xgtest, ntree_limit=1)
# 用前9颗树预测
ypred2 = bst.predict(xgtest, ntree_limit=9)
label = xgtest.get_label()
print ('用前1颗树预测的错误率为 %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('用前9颗树预测的错误率为 %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))
[0]	eval-error:0.28125	train-error:0.203125
[1] eval-error:0.182292 train-error:0.1875
[2] eval-error:0.21875 train-error:0.184028
[3] eval-error:0.213542 train-error:0.175347
[4] eval-error:0.223958 train-error:0.164931
[5] eval-error:0.223958 train-error:0.164931
[6] eval-error:0.208333 train-error:0.164931
[7] eval-error:0.192708 train-error:0.15625
[8] eval-error:0.21875 train-error:0.15625
[9] eval-error:0.208333 train-error:0.147569
用前1颗树预测的错误率为 0.281250
用前9颗树预测的错误率为 0.218750

sklearn与Xgboost配合使用

1.Xgboost建模,sklearn评估

import pickle
import xgboost as xgb import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston rng = np.random.RandomState(31337) # 二分类:混淆矩阵
print("数字0和1的二分类问题")
digits = load_digits(2)
y = digits['target']
X = digits['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions)) # 多分类:混淆矩阵
print("\nIris: 多分类")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩阵:")
print(confusion_matrix(actuals, predictions)) # 回归问题:MSE
print("\n波士顿房价回归预测问题")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折数据上的交叉验证")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("MSE:",mean_squared_error(actuals, predictions))
数字0和1的二分类问题
在2折数据上的交叉验证
混淆矩阵:
[[87 0]
[ 1 92]]
混淆矩阵:
[[91 0]
[ 3 86]] Iris: 多分类
在2折数据上的交叉验证
混淆矩阵:
[[19 0 0]
[ 0 31 3]
[ 0 1 21]]
混淆矩阵:
[[31 0 0]
[ 0 16 0]
[ 0 3 25]] 波士顿房价回归预测问题
在2折数据上的交叉验证
[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 9.860776812557337
[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 15.942418468446029

2.网格搜索查找最优超参数

# 第2种训练方法的 调参方法:使用sklearn接口的regressor + GridSearchCV
print("参数最优化:")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
param_dict = {'max_depth': [2,4,6],
'n_estimators': [50,100,200]} clf = GridSearchCV(xgb_model, param_dict, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)
参数最优化:
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.6001029721598573
{'max_depth': 4, 'n_estimators': 100} [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.7s finished

3.early-stopping 早停

# 第1/2种训练方法的 调参方法:early stopping
# 在训练集上学习模型,一颗一颗树添加,在验证集上看效果,当验证集效果不再提升,停止树的添加与生长
X = digits['data']
y = digits['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train,
y_train,
early_stopping_rounds=10,
eval_metric="auc",
eval_set=[(X_val, y_val)])
[0]	validation_0-auc:0.999497
Will train until validation_0-auc hasn't improved in 10 rounds.
[1] validation_0-auc:0.999497
[2] validation_0-auc:0.999497
[3] validation_0-auc:0.999749
[4] validation_0-auc:0.999749
[5] validation_0-auc:0.999749
[6] validation_0-auc:0.999749
[7] validation_0-auc:0.999749
[8] validation_0-auc:0.999749
[9] validation_0-auc:0.999749
[10] validation_0-auc:1
[11] validation_0-auc:1
[12] validation_0-auc:1
[13] validation_0-auc:1
[14] validation_0-auc:1
[15] validation_0-auc:1
[16] validation_0-auc:1
[17] validation_0-auc:1
[18] validation_0-auc:1
[19] validation_0-auc:1
[20] validation_0-auc:1
Stopping. Best iteration:
[10] validation_0-auc:1 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)

4.特征重要度

iris = load_iris()
y = iris['target']
X = iris['data']
xgb_model = xgb.XGBClassifier().fit(X,y) print('特征排序:')
feature_names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
# 获取特征重要度
feature_importances = xgb_model.feature_importances_
indices = np.argsort(feature_importances)[::-1] for index in indices:
print("特征 %s 重要度为 %f" %(feature_names[index], feature_importances[index])) %matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.title("feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices], color='b')
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices], color='b')
特征排序:
特征 petal_length 重要度为 0.595834
特征 petal_width 重要度为 0.358166
特征 sepal_width 重要度为 0.033481
特征 sepal_length 重要度为 0.012520 ([<matplotlib.axis.XTick at 0x1ed5a5bc7b8>,
<matplotlib.axis.XTick at 0x1ed5a3e6278>,
<matplotlib.axis.XTick at 0x1ed5a65c780>,
<matplotlib.axis.XTick at 0x1ed5a669748>],
<a list of 4 Text xticklabel objects>)

5.并行训练加速

import os

if __name__ == "__main__":
try:
from multiprocessing import set_start_method
except ImportError:
raise ImportError("Unable to import multiprocessing.set_start_method."
" This example only runs on Python 3.4")
set_start_method("forkserver") import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston
import xgboost as xgb rng = np.random.RandomState(31337) print("Parallel Parameter optimization")
boston = load_boston() os.environ["OMP_NUM_THREADS"] = "2" # or to whatever you want
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
{'max_depth': [2, 4, 6],'n_estimators': [50, 100, 200]},
verbose=1,
n_jobs=2)
clf.fit(X, y)
print(clf.best_score_)
print(clf.best_params_)

Xgboost建模的更多相关文章

  1. R语言︱XGBoost极端梯度上升以及forecastxgb(预测)+xgboost(回归)双案例解读

    XGBoost不仅仅可以用来做分类还可以做时间序列方面的预测,而且已经有人做的很好,可以见最后的案例. 应用一:XGBoost用来做预测 ------------------------------- ...

  2. python平台下实现xgboost算法及输出的解释

    python平台下实现xgboost算法及输出的解释 1. 问题描述 ​ 近来, 在python环境下使用xgboost算法作若干的机器学习任务, 在这个过程中也使用了其内置的函数来可视化树的结果, ...

  3. 机器学习实战 | SKLearn最全应用指南

    作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/41 本文地址:http://www.showmeai.tech/article-det ...

  4. 用R语言对一个信用卡数据实现logit,GBM,knn,xgboost

    Prepare the data 数据来自UCIhttp://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening,一个 ...

  5. Stacking:Catboost、Xgboost、LightGBM、Adaboost、RF etc

    python风控评分卡建模和风控常识(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005214003&am ...

  6. 前向分步算法 && AdaBoost算法 && 提升树(GBDT)算法 && XGBoost算法

    1. 提升方法 提升(boosting)方法是一种常用的统计学方法,在分类问题中,它通过逐轮不断改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的性能 0x1: 提升方法的基本 ...

  7. Xgboost总结

    从决策树.随机森林.GBDT最终到XGBoost,每个热门算法都不是孤立存在的,而是基于一系列算法的改进与优化.决策树算法简单易懂可解释性强,但是过拟合风险很大,应用场景有限:随机森林采用Baggin ...

  8. 机器学习算法总结(四)——GBDT与XGBOOST

    Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

  9. xgboost使用调参

    欢迎关注博主主页,学习python视频资源 https://blog.csdn.net/q383700092/article/details/53763328 调参后结果非常理想 from sklea ...

随机推荐

  1. Linux命令——pgrep

    参考:Linux pgrep Command Tutorial for Beginners (10 Examples) Linux命令——ps.pstree bash基础——grep.基本正则表达式. ...

  2. python 多分类任务中按照类别分层采样

    在机器学习多分类任务中有时候需要针对类别进行分层采样,比如说类别不均衡的数据,这时候随机采样会造成训练集.验证集.测试集中不同类别的数据比例不一样,这是会在一定程度上影响分类器的性能的,这时候就需要进 ...

  3. python函数式编程-偏向函数

    Python的functools模块提供了很多有用的功能,其中一个就是偏函数(Partial function).要注意,这里的偏函数和数学意义上的偏函数不一样. 在介绍函数参数的时候,我们讲到,通过 ...

  4. java 从上至下打印二叉树

    从上往下打印二叉树题目描述: 从上往下打印出二叉树的每个节点,同层节点从左至右打印. 输入: 输入可能包含多个测试样例. 对于每个测试案例,输入的第一行一个整数n(1<=n<=1000, ...

  5. vue 关于vuex

    <!-- vuex 配置js store.js -->1.引入vue和vuex import Vue from 'vue'import Vuex from 'vuex'Vue.use(Vu ...

  6. Linux网络编程综合运用之MiniFtp实现(六)

    间隔了一周时间没写了,由于今年的股势行情貌似不错的样子,对于对股市完全不懂的我也在蠢蠢欲动,所以最近一周业余时间在“不务正业”-----学习炒股.发现学习它其实挺费神的,满脑子都是走势图,而且是神经有 ...

  7. Docker那些事儿之编排工具docker-compose

    前面已经讲解过docker的一些基础使用,镜像创建的操作过程,如果大量容器需要同时部署,一个一个容器进行服务器上的部署,估计要疯掉,在使用上我们需要找到更好更便捷的使用方式,今天要讲解的容器编排工具d ...

  8. Django REST framework+Vue 打造生鲜电商项目(笔记三)

    (PS:转载自http://www.cnblogs.com/derek1184405959/p/8810591.html  有修改) 一.drf的过滤 (1)添加到app里面 INSTALLED_AP ...

  9. HDU 2454 Degree Sequence of Graph G——可简单图化&&Heavel定理

    题意 给你一个度序列,问能否构成一个简单图. 分析 对于可图化,只要满足度数之和是偶数,即满足握手定理. 对于可简单图化,就是Heavel定理了. Heavel定理:把度序列排成不增序,即 $deg[ ...

  10. Codeforces#572 Div2 C---Candies!【倍增】【DP】【思维】

    题目:http://codeforces.com/contest/1189/problem/C 题意:给定n个数,每次查询一个区间$[l,r]$.对这个区间内的数,相邻两个数之和超过10,则得到一个c ...