Xgboost建模

xgboost参数

选择较高的学习速率(learning rate)。一般情况下，学习速率的值为0.1。但是，对于不同的问题，理想的学习速率有时候会在0.05到0.3之间波动。选择对应于此学习速率的理想决策树数量。XGBoost有一个很有用的函数“cv”，这个函数可以在每一次迭代中使用交叉验证，并返回理想的决策树数量。
对于给定的学习速率和决策树数量，进行决策树特定参数调优(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。在确定一棵树的过程中，我们可以选择不同的参数，待会儿我会举例说明。
xgboost的正则化参数的调优。(lambda, alpha)。这些参数可以降低模型的复杂度，从而提高模型的表现。
降低学习速率，确定理想参数。

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

①使用xgboost自带的数据集格式 + xgboost自带的建模方式
- 把数据读取成xgb.DMatrix格式(libsvm/dataframe.values给定X和Y)
- 准备好一个watch_list(观测和评估的数据集)
- xgb.train(dtrain)
- xgb.predict(dtest)
②使用pandas的DataFrame格式 + xgboost的sklearn接口
- estimator = xgb.XGBClassifier()/xgb.XGBRegressor()
- estimator.fit(df_train.values, df_target.values)

#!/usr/bin/python

import numpy as np

#import scipy.sparse

import pickle

import xgboost as xgb

# 基本例子，从libsvm文件中读取数据，做二分类

# 数据是libsvm的格式

#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1

#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1

#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1

# 转换成Dmatrix格式

dtrain = xgb.DMatrix('./data/agaricus.txt.train')

dtest = xgb.DMatrix('./data/agaricus.txt.test')

# 超参数设定

param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }

# 设定watchlist用于查看模型状态

watchlist  = [(dtest,'eval'), (dtrain,'train')]

num_round = 2

bst = xgb.train(param, dtrain, num_round, watchlist)

# 使用模型预测

preds = bst.predict(dtest)

# 判断准确率

labels = dtest.get_label()

print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# 模型存储

bst.save_model('./model/0001.model')

[15:49:14] 6513x127 matrix with 143286 entries loaded from ./data/agaricus.txt.train

[15:49:14] 1611x127 matrix with 35442 entries loaded from ./data/agaricus.txt.test

[0]	eval-error:0.042831	train-error:0.046522

[1]	eval-error:0.021726	train-error:0.022263

错误类为0.021726

2.配合pandas DataFrame格式数据建模

# 皮马印第安人糖尿病数据集 包含很多字段：怀孕次数 口服葡萄糖耐量试验中血浆葡萄糖浓度 舒张压（mm Hg） 三头肌组织褶厚度（mm）

# 2小时血清胰岛素（μU/ ml） 体重指数（kg/（身高(m)^2） 糖尿病系统功能 年龄（岁）

import pandas as pd

data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

#!/usr/bin/python

import numpy as np

import pandas as pd

import pickle

import xgboost as xgb

from sklearn.model_selection import train_test_split

# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据

data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# 做数据切分

train, test = train_test_split(data)

# 转换成Dmatrix格式

feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

target_column = 'Outcome'

xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)

xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)

# 参数设定

param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }

# 设定watchlist用于查看模型状态

watchlist  = [(xgtest,'eval'), (xgtrain,'train')]

num_round = 10

bst = xgb.train(param, xgtrain, num_round, watchlist)

# 使用模型预测

preds = bst.predict(xgtest)

# 判断准确率

labels = xgtest.get_label()

print ('错误类为%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))

# 模型存储

bst.save_model('./model/0002.model')

[0]	eval-error:0.322917	train-error:0.21875

[1]	eval-error:0.244792	train-error:0.168403

[2]	eval-error:0.255208	train-error:0.182292

[3]	eval-error:0.270833	train-error:0.170139

[4]	eval-error:0.244792	train-error:0.144097

[5]	eval-error:0.25	train-error:0.145833

[6]	eval-error:0.229167	train-error:0.144097

[7]	eval-error:0.25	train-error:0.145833

[8]	eval-error:0.239583	train-error:0.147569

[9]	eval-error:0.234375	train-error:0.140625

错误类为0.234375

3.使用xgboost的sklearn包

#!/usr/bin/python

import warnings

warnings.filterwarnings("ignore")

import numpy as np

import pandas as pd

import pickle

import xgboost as xgb

from sklearn.model_selection import train_test_split

from sklearn.externals import joblib

# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据

data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# 做数据切分

train, test = train_test_split(data)

# 取出特征X和目标y的部分

feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

target_column = 'Outcome'

train_X = train[feature_columns].values

train_y = train[target_column].values

test_X = test[feature_columns].values

test_y = test[target_column].values

# 初始化模型

xgb_classifier = xgb.XGBClassifier(n_estimators=20,\

                                   max_depth=4, \

                                   learning_rate=0.1, \

                                   subsample=0.7, \

                                   colsample_bytree=0.7)

# 拟合模型

xgb_classifier.fit(train_X, train_y)

# 使用模型预测

preds = xgb_classifier.predict(test_X)

# 判断准确率

print ('错误类为%f' %((preds!=test_y).sum()/float(test_y.shape[0])))

# 模型存储

joblib.dump(xgb_classifier, './model/0003.model')

错误类为0.276042

['./model/0003.model']

4.交叉验证

xgb.cv(param, dtrain, num_round, nfold=5,metrics={'error'}, seed = 0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	train-error-mean	train-error-std	test-error-mean	test-error-std
0	0.006832	0.001012	0.006756	0.001407
1	0.002994	0.002806	0.002303	0.002524
2	0.001382	0.000352	0.001382	0.001228
3	0.001190	0.000658	0.001382	0.001228
4	0.001382	0.000282	0.001075	0.000921
5	0.000921	0.000506	0.001228	0.001041
6	0.000921	0.000506	0.001228	0.001041
7	0.000921	0.000506	0.001228	0.001041
8	0.000921	0.000506	0.001228	0.001041
9	0.000921	0.000506	0.001228	0.001041

5.添加预处理的交叉验证

# 计算正负样本比，调整样本权重

def fpreproc(dtrain, dtest, param):

    label = dtrain.get_label()

    ratio = float(np.sum(label == 0)) / np.sum(label==1)

    param['scale_pos_weight'] = ratio

    return (dtrain, dtest, param)

# 先做预处理，计算样本权重，再做交叉验证

xgb.cv(param, dtrain, num_round, nfold=5, metrics={'auc'}, seed = 0, fpreproc = fpreproc)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.999772	0.000126	0.999731	0.000191
1	0.999942	0.000044	0.999909	0.000085
2	0.999964	0.000035	0.999926	0.000084
3	0.999979	0.000036	0.999950	0.000089
4	0.999976	0.000043	0.999946	0.000098
5	0.999994	0.000010	0.999988	0.000020
6	0.999993	0.000012	0.999988	0.000020
7	0.999993	0.000012	0.999988	0.000020
8	0.999993	0.000012	0.999988	0.000020
9	0.999993	0.000012	0.999988	0.000020

6.自定义损失函数与评估准则

print ('running cross validation, with cutomsized loss function')

# 自定义损失函数，需要提供损失函数的一阶导和二阶导

def logregobj(preds, dtrain):

    labels = dtrain.get_label()

    preds = 1.0 / (1.0 + np.exp(-preds))

    grad = preds - labels

    hess = preds * (1.0-preds)

    return grad, hess

# 自定义评估准则，评估预估值和标准答案之间的差距

def evalerror(preds, dtrain):

    labels = dtrain.get_label()

    return 'error', float(sum(labels != (preds > 0.0))) / len(labels)

watchlist  = [(dtest,'eval'), (dtrain,'train')]

param = {'max_depth':3, 'eta':0.1, 'silent':1}

num_round = 5

# 自定义损失函数训练

bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)

# 交叉验证

xgb.cv(param, dtrain, num_round, nfold = 5, seed = 0, obj = logregobj, feval=evalerror)

running cross validation, with cutomsized loss function

[0]	eval-rmse:0.306902	train-rmse:0.306163	eval-error:0.518312	train-error:0.517887

[1]	eval-rmse:0.17919	train-rmse:0.177276	eval-error:0.518312	train-error:0.517887

[2]	eval-rmse:0.172566	train-rmse:0.171727	eval-error:0.016139	train-error:0.014433

[3]	eval-rmse:0.269611	train-rmse:0.271113	eval-error:0.016139	train-error:0.014433

[4]	eval-rmse:0.396904	train-rmse:0.398245	eval-error:0.016139	train-error:0.014433

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	train-error-mean	train-error-std	train-rmse-mean	train-rmse-std	test-error-mean	test-error-std	test-rmse-mean	test-rmse-std
0	0.517887	0.001085	0.308880	0.005170	0.517886	0.004343	0.309038	0.005207
1	0.517887	0.001085	0.176504	0.002046	0.517886	0.004343	0.177802	0.003767
2	0.014433	0.000223	0.172680	0.003719	0.014433	0.000892	0.174890	0.009391
3	0.014433	0.000223	0.275761	0.001776	0.014433	0.000892	0.276689	0.005918
4	0.014433	0.000223	0.399889	0.003369	0.014433	0.000892	0.400118	0.006243

7.只用前n颗树预测

#!/usr/bin/python

import numpy as np

import pandas as pd

import pickle

import xgboost as xgb

from sklearn.model_selection import train_test_split

# 基本例子，从csv文件中读取数据，做二分类

# 用pandas读入数据

data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')

# 做数据切分

train, test = train_test_split(data)

# 转换成Dmatrix格式

feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

target_column = 'Outcome'

xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)

xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)

# 参数设定

param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }

# 设定watchlist用于查看模型状态

watchlist  = [(xgtest,'eval'), (xgtrain,'train')]

num_round = 10

bst = xgb.train(param, xgtrain, num_round, watchlist)

# 只用第1颗树预测

ypred1 = bst.predict(xgtest, ntree_limit=1)

# 用前9颗树预测

ypred2 = bst.predict(xgtest, ntree_limit=9)

label = xgtest.get_label()

print ('用前1颗树预测的错误率为 %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))

print ('用前9颗树预测的错误率为 %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))

[0]	eval-error:0.28125	train-error:0.203125

[1]	eval-error:0.182292	train-error:0.1875

[2]	eval-error:0.21875	train-error:0.184028

[3]	eval-error:0.213542	train-error:0.175347

[4]	eval-error:0.223958	train-error:0.164931

[5]	eval-error:0.223958	train-error:0.164931

[6]	eval-error:0.208333	train-error:0.164931

[7]	eval-error:0.192708	train-error:0.15625

[8]	eval-error:0.21875	train-error:0.15625

[9]	eval-error:0.208333	train-error:0.147569

用前1颗树预测的错误率为 0.281250

用前9颗树预测的错误率为 0.218750

sklearn与Xgboost配合使用

1.Xgboost建模，sklearn评估

import pickle

import xgboost as xgb

import numpy as np

from sklearn.model_selection import KFold, train_test_split, GridSearchCV

from sklearn.metrics import confusion_matrix, mean_squared_error

from sklearn.datasets import load_iris, load_digits, load_boston

rng = np.random.RandomState(31337)

# 二分类：混淆矩阵

print("数字0和1的二分类问题")

digits = load_digits(2)

y = digits['target']

X = digits['data']

kf = KFold(n_splits=2, shuffle=True, random_state=rng)

print("在2折数据上的交叉验证")

for train_index, test_index in kf.split(X):

    xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])

    predictions = xgb_model.predict(X[test_index])

    actuals = y[test_index]

    print("混淆矩阵:")

    print(confusion_matrix(actuals, predictions))

# 多分类：混淆矩阵

print("\nIris: 多分类")

iris = load_iris()

y = iris['target']

X = iris['data']

kf = KFold(n_splits=2, shuffle=True, random_state=rng)

print("在2折数据上的交叉验证")

for train_index, test_index in kf.split(X):

    xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])

    predictions = xgb_model.predict(X[test_index])

    actuals = y[test_index]

    print("混淆矩阵:")

    print(confusion_matrix(actuals, predictions))

# 回归问题：MSE

print("\n波士顿房价回归预测问题")

boston = load_boston()

y = boston['target']

X = boston['data']

kf = KFold(n_splits=2, shuffle=True, random_state=rng)

print("在2折数据上的交叉验证")

for train_index, test_index in kf.split(X):

    xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])

    predictions = xgb_model.predict(X[test_index])

    actuals = y[test_index]

    print("MSE:",mean_squared_error(actuals, predictions))

数字0和1的二分类问题

在2折数据上的交叉验证

混淆矩阵:

[[87  0]

 [ 1 92]]

混淆矩阵:

[[91  0]

 [ 3 86]]

Iris: 多分类

在2折数据上的交叉验证

混淆矩阵:

[[19  0  0]

 [ 0 31  3]

 [ 0  1 21]]

混淆矩阵:

[[31  0  0]

 [ 0 16  0]

 [ 0  3 25]]

波士顿房价回归预测问题

在2折数据上的交叉验证

[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

MSE: 9.860776812557337

[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

MSE: 15.942418468446029

2.网格搜索查找最优超参数

# 第2种训练方法的 调参方法：使用sklearn接口的regressor + GridSearchCV

print("参数最优化：")

y = boston['target']

X = boston['data']

xgb_model = xgb.XGBRegressor()

param_dict = {'max_depth': [2,4,6],

              'n_estimators': [50,100,200]}

clf = GridSearchCV(xgb_model, param_dict, verbose=1)

clf.fit(X,y)

print(clf.best_score_)

print(clf.best_params_)

参数最优化：

Fitting 3 folds for each of 9 candidates, totalling 27 fits

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

0.6001029721598573

{'max_depth': 4, 'n_estimators': 100}

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.7s finished

3.early-stopping 早停

# 第1/2种训练方法的 调参方法：early stopping

# 在训练集上学习模型，一颗一颗树添加，在验证集上看效果，当验证集效果不再提升，停止树的添加与生长

X = digits['data']

y = digits['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)

clf = xgb.XGBClassifier()

clf.fit(X_train,

        y_train,

        early_stopping_rounds=10,

        eval_metric="auc",

        eval_set=[(X_val, y_val)])

[0]	validation_0-auc:0.999497

Will train until validation_0-auc hasn't improved in 10 rounds.

[1]	validation_0-auc:0.999497

[2]	validation_0-auc:0.999497

[3]	validation_0-auc:0.999749

[4]	validation_0-auc:0.999749

[5]	validation_0-auc:0.999749

[6]	validation_0-auc:0.999749

[7]	validation_0-auc:0.999749

[8]	validation_0-auc:0.999749

[9]	validation_0-auc:0.999749

[10]	validation_0-auc:1

[11]	validation_0-auc:1

[12]	validation_0-auc:1

[13]	validation_0-auc:1

[14]	validation_0-auc:1

[15]	validation_0-auc:1

[16]	validation_0-auc:1

[17]	validation_0-auc:1

[18]	validation_0-auc:1

[19]	validation_0-auc:1

[20]	validation_0-auc:1

Stopping. Best iteration:

[10]	validation_0-auc:1

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,

       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,

       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,

       n_estimators=100, n_jobs=1, nthread=None,

       objective='binary:logistic', random_state=0, reg_alpha=0,

       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,

       subsample=1, verbosity=1)

4.特征重要度

iris = load_iris()

y = iris['target']

X = iris['data']

xgb_model = xgb.XGBClassifier().fit(X,y)

print('特征排序：')

feature_names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# 获取特征重要度

feature_importances = xgb_model.feature_importances_

indices = np.argsort(feature_importances)[::-1]

for index in indices:

    print("特征 %s 重要度为 %f" %(feature_names[index], feature_importances[index]))

%matplotlib inline

import matplotlib.pyplot as plt

plt.figure(figsize=(16,8))

plt.title("feature importances")

plt.bar(range(len(feature_importances)), feature_importances[indices], color='b')

plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices], color='b')

特征排序：

特征 petal_length 重要度为 0.595834

特征 petal_width 重要度为 0.358166

特征 sepal_width 重要度为 0.033481

特征 sepal_length 重要度为 0.012520

([<matplotlib.axis.XTick at 0x1ed5a5bc7b8>,

  <matplotlib.axis.XTick at 0x1ed5a3e6278>,

  <matplotlib.axis.XTick at 0x1ed5a65c780>,

  <matplotlib.axis.XTick at 0x1ed5a669748>],

 <a list of 4 Text xticklabel objects>)

5.并行训练加速

import os

if __name__ == "__main__":

    try:

        from multiprocessing import set_start_method

    except ImportError:

        raise ImportError("Unable to import multiprocessing.set_start_method."

                          " This example only runs on Python 3.4")

    set_start_method("forkserver")

    import numpy as np

    from sklearn.model_selection import GridSearchCV

    from sklearn.datasets import load_boston

    import xgboost as xgb

    rng = np.random.RandomState(31337)

    print("Parallel Parameter optimization")

    boston = load_boston()

    os.environ["OMP_NUM_THREADS"] = "2"  # or to whatever you want

    y = boston['target']

    X = boston['data']

    xgb_model = xgb.XGBRegressor()

    clf = GridSearchCV(xgb_model,

                       {'max_depth': [2, 4, 6],'n_estimators': [50, 100, 200]},

                       verbose=1,

                       n_jobs=2)

    clf.fit(X, y)

    print(clf.best_score_)

    print(clf.best_params_)

Xgboost建模的更多相关文章

R语言︱XGBoost极端梯度上升以及forecastxgb（预测）+xgboost（回归）双案例解读
XGBoost不仅仅可以用来做分类还可以做时间序列方面的预测,而且已经有人做的很好,可以见最后的案例. 应用一:XGBoost用来做预测 ------------------------------- ...
python平台下实现xgboost算法及输出的解释
python平台下实现xgboost算法及输出的解释 1. 问题描述近来, 在python环境下使用xgboost算法作若干的机器学习任务, 在这个过程中也使用了其内置的函数来可视化树的结果, ...
机器学习实战 | SKLearn最全应用指南
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/41 本文地址:http://www.showmeai.tech/article-det ...
用R语言对一个信用卡数据实现logit,GBM,knn,xgboost
Prepare the data 数据来自UCIhttp://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening,一个 ...
Stacking：Catboost、Xgboost、LightGBM、Adaboost、RF etc
python风控评分卡建模和风控常识(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005214003&am ...
前向分步算法 && AdaBoost算法 && 提升树（GBDT）算法 && XGBoost算法
1. 提升方法提升(boosting)方法是一种常用的统计学方法,在分类问题中,它通过逐轮不断改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的性能 0x1: 提升方法的基本 ...
Xgboost总结
从决策树.随机森林.GBDT最终到XGBoost,每个热门算法都不是孤立存在的,而是基于一系列算法的改进与优化.决策树算法简单易懂可解释性强,但是过拟合风险很大,应用场景有限:随机森林采用Baggin ...
机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...
xgboost使用调参
欢迎关注博主主页,学习python视频资源 https://blog.csdn.net/q383700092/article/details/53763328 调参后结果非常理想 from sklea ...

随机推荐

jenkins中的流水线( pipeline)的理解（未完）
目录一.理论概述 Jenkins流水线的发展历程什么是Jenkins流水线一.理论概述 pipeline是流水线的英文释义,文档中统一称为流水线 Jenkins流水线的发展历程在Jenki ...
优化你的HTTPS（上），你需要这么做
HTTP/2 HTTP 2.0即超文本传输协议 2.0,是下一代HTTP协议.是由互联网工程任务组(IETF)的Hypertext Transfer Protocol Bis (httpbis)工作小 ...
Jquery “This”的指向
JavaScript中的this不总是指向当前对象,函数或类中的this指向与调用这个函数的对象以及上下文环境是息息相关的.如在全局作用域调用一个含this的对象,此时当前对象的this指向的是win ...
RabbitMQ3 单机及集群安装配置及优化
一.操作系统需求及配置 # 1.1.操作系统推荐配置 4C*8G*40G磁盘 # 1.2.内核参数优化 # 系统参数需要留有swap空间,rabbitmq 启动进程用户打开文件数至少需要5万,yum安 ...
CSS定位有几种？分别描述其不同
static(静态定位):即默认值,元素框正常生成的,top.right.bottom.left这些偏移属性不会影响静态定位的正常显示(属性不应用): relative(相对定位):元素相对自身偏移某 ...
2019-2020-1 20199312《Linux内核原理与分析》第四周作业
计算机和操作系统的法宝计算机三个法宝存储程序计算机.函数调用堆栈机制.中断操作系统:中断中断上下文的切换--保护和恢复现场进程上下文的切换. Linux源代码目录分析 arch目录:代码量庞大 ...
[Angular] Lazy Load CSS at runtime with the Angular CLI
Ever had the need for multiple "app themes", or even to completely dynamically load CSS ba ...
CefSharp支持flash
https://www.cnblogs.com/yaosj/p/10811687.html
12 | 为什么我的MySQL会“抖”一下？
平时的工作中,不知道你有没有遇到过这样的场景,一条SQL语句,正常执行的时候特别快,但是有时也不知道怎么回事,它就会变得特别慢,并且这样的场景很难复现,它不只随机,而且持续时间还很短. 看上去,这就像 ...
learning scala stripMargin
(1)Scala中创建多行字符串使用Scala的Multiline String. 在Scala中,利用三个双引号包围多行字符串就可以实现. 代码实例如: val foo = "" ...

Xgboost建模

xgboost参数

1.读取libsvm格式数据并指定参数建模

xgboost的使用方法

2.配合pandas DataFrame格式数据建模

3.使用xgboost的sklearn包

4.交叉验证

5.添加预处理的交叉验证

6.自定义损失函数与评估准则

7.只用前n颗树预测

sklearn与Xgboost配合使用

1.Xgboost建模，sklearn评估

2.网格搜索查找最优超参数

3.early-stopping 早停

4.特征重要度

5.并行训练加速

Xgboost建模的更多相关文章

随机推荐

热门专题