XGBoost使用教程（进阶篇）三

一、Importing all the libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import accuracy_score
二、Reading the file

还是蘑菇数据集，直接采用Kaggle竞赛中22维特征 https://www.kaggle.com/uciml/mushroom-classification

数据集下载地址：http://download.csdn.net/download/u011630575/10266626

# path to where the data lies
dpath = './data/'
data = pd.read_csv(dpath+"mushrooms.csv")
data.head(6)
三、Let us check if there is any null values
data.isnull().sum() #检查数据有没有空值
四、check if we have two claasification. Either the mushroom is poisonous or edible
data['class'].unique() #检查是否只有蘑菇的种类，有毒，可使用
print(data.dtypes)
五、check if 22 features(1st one is label) and 8124 instances
data.shape #22个特征 8124个样例第一个是标签
六、The dataset has values in strings. We need to convert all the unique values to integers. Thus we perform label encoding on the data 标准化标签

from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder() #标准化标签，将标签值统一转换成range(标签值个数-1)范围内
for col in data.columns:
data[col] = labelencoder.fit_transform(data[col])

data.head()
Separating features and label

X = data.iloc[:,1:23] # 获取1-23行特征
y = data.iloc[:, 0] # 获取0行标签
X.head()
y.head()
Splitting the data into training and testing dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=4)
七、default Logistic Regression

from sklearn.linear_model import LogisticRegression
model_LR= LogisticRegression()
model_LR.fit(X_train,y_train)
y_prob = model_LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_LR.score(X_test, y_pred)
注：np.where(condition,x,y) 是三元运算符，conditon条件成立则结果为x，否则为y。

accuracy

auc_roc=metrics.roc_auc_score(y_test,y_pred)
print(auc_roc)
八、Logistic Regression(Tuned model) 调整模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics

LR_model= LogisticRegression()

tuned_parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] ,
'penalty': ['l1','l2']
}
九、CV

from sklearn.model_selection import GridSearchCV

LR= GridSearchCV(LR_model, tuned_parameters,cv=10)
LR.fit(X_train,y_train)
print(LR.best_params_)
y_prob = LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
LR.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
print(auc_roc)
十、Default Decision Tree model

from sklearn.tree import DecisionTreeClassifier

model_tree = DecisionTreeClassifier()
model_tree.fit(X_train, y_train)
y_prob = model_tree.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_tree.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十一、Let us tune the hyperparameters of the Decision tree model

from sklearn.tree import DecisionTreeClassifier

model_DD = DecisionTreeClassifier()

tuned_parameters= { 'max_features': ["auto","sqrt","log2"],
'min_samples_leaf': range(1,100,1) , 'max_depth': range(1,50,1)
}
#tuned_parameters= { 'max_features': ["auto","sqrt","log2"] }

#If “auto”, then max_features=sqrt(n_features).
from sklearn.model_selection import GridSearchCV
DD = GridSearchCV(model_DD, tuned_parameters,cv=10)
DD.fit(X_train, y_train)
print(DD.grid_scores_)
print(DD.best_score_)
print(DD.best_params_)
y_prob = DD.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
DD.score(X_test, y_pred)
auc_roc=metrics.classification_report(y_test,y_pred)
print(auc_roc)
十二、Default Random Forest
from sklearn.ensemble import RandomForestClassifier

model_RR=RandomForestClassifier()
model_RR.fit(X_train,y_train)
y_prob = model_RR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_RR.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十三、Let us tuned the parameters of Random Forest just for the purpose of knowledge
1) max_features

2) n_estimators 估计量

3) min_sample_leaf

from sklearn.ensemble import RandomForestClassifier

model_RR=RandomForestClassifier()

tuned_parameters = {'min_samples_leaf' range(10,100,10), 'n_estimators' : range(10,100,10),
'max_features':['auto','sqrt','log2']
}

from sklearn.model_selection import GridSearchCV
RR = GridSearchCV(model_RR, tuned_parameters,cv=10)

RR.fit(X_train,y_train)

print(RR.grid_scores_)

print(RR.best_score_)

print(RR.best_params_)

y_prob = RR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
RR_model.score(X_test, y_pred)

auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十四、Default XGBoost

from xgboost import XGBClassifier
model_XGB=XGBClassifier()
model_XGB.fit(X_train,y_train)
y_prob = model_XGB.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_XGB.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十五、特征重要性
在XGBoost中特征重要性已经自动算好，存放在featureimportances

print(model_XGB.feature_importances_)
from matplotlib import pyplot
pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()
# plot feature importance using built-in function
from xgboost import plot_importance
plot_importance(model_XGB)
pyplot.show()
可以根据特征重要性进行特征选择
from numpy import sort
from sklearn.feature_selection import SelectFromModel

# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
accuracy*100.0))

XGBoost使用教程（进阶篇）三的更多相关文章

RabbitMQ基础教程之使用进阶篇
RabbitMQ基础教程之使用进阶篇相关博文,推荐查看: RabbitMq基础教程之安装与测试 RabbitMq基础教程之基本概念 RabbitMQ基础教程之基本使用篇 I. 背景前一篇基本使用篇 ...
《手把手教你》系列进阶篇之1-python+ selenium自动化测试 - python基础扫盲（详细教程）
1. 简介如果你从一开始就跟着宏哥看博客文章到这里,基础篇和练习篇的文章.如果你认真看过,并且手动去敲过每一篇的脚本代码,那边恭喜你,至少说你算真正会利用Python+Selenium编写自动化脚本 ...
Membership三步曲之进阶篇 - 深入剖析Provider Model
Membership 三步曲之进阶篇 - 深入剖析Provider Model 本文的目标是让每一个人都知道Provider Model 是什么,并且能灵活的在自己的项目中使用它. Membershi ...
【MongoDB】NoSQL Manager for MongoDB 教程（进阶篇）
项目做完,有点时间,接着写下第二篇吧.回顾戳这里基础篇:安装.连接mongodb.使用shell.增删改查.表复制本文属于进阶篇,为什么叫进阶篇,仅仅是因为这些功能属于DB范畴,一般使用的不多, ...
PHP 进阶篇：面向对象的设计原则，自动加载类，类型提示，traits，命名空间，spl的使用，反射的使用，php常用设计模式（麦子学员第三阶段）
以下是进阶篇的内容:面向对象的设计原则,自动加载类,类型提示,traits,命名空间,spl的使用,反射的使用,php常用设计模式 ================================== ...
2. web前端开发分享-css,js进阶篇
一,css进阶篇: 等css哪些事儿看了两三遍之后,需要对看过的知识综合应用,这时候需要大量的实践经验, 简单的想法:把qq首页全屏另存为jpg然后通过ps工具切图结合css转换成html,有无从下手 ...
web前端开发分享-css,js进阶篇
一,css进阶篇: 等css哪些事儿看了两三遍之后,需要对看过的知识综合应用,这时候需要大量的实践经验, 简单的想法:把qq首页全屏另存为jpg然后通过ps工具切图结合css转换成html,有无从 ...
进阶篇，第二章：MC与Forge的Event系统
<基于1.8 Forge的Minecraft mod制作经验分享> 这一章其实才应该是第一章,矿物生成里面用到了Event的一些内容.如果你对之前矿物生成那一章的将算法插入ORE_GEN_ ...
Spring+SpringMVC+MyBatis+easyUI整合进阶篇（十五）阶段总结
作者:13 GitHub:https://github.com/ZHENFENG13 版权声明:本文为原创文章,未经允许不得转载. 一每个阶段在结尾时都会有一个阶段总结,在<SSM整合基础篇& ...
[转]抢先Mark！微信公众平台开发进阶篇资源集锦
FROM : http://www.csdn.net/article/2014-08-01/2820986 由CSDN和<程序员>杂志联合主办的 2014年微信开发者大会将于8月23日在 ...

随机推荐

mysql 聚集索引，非聚集索引，覆盖索引区别。
把原站信息经过筛选贴过来,用于自己备忘.原站:https://www.cnblogs.com/aspwebchh/p/6652855.html ---------------------------- ...
monkey-api-encrypt 1.1.2版本发布啦
时隔10多天,monkey-api-encrypt发布了第二个版本,还是要感谢一些正在使用的朋友们,提出了一些问题. GitHub主页:https://github.com/yinjihuan/mon ...
[******] java多线程连续打印abc
题目描述建立三个线程A.B.C,A线程打印10次字母A,B线程打印10次字母B,C线程打印10次字母C,但是要求三个线程同时运行,并且实现交替打印,即按照ABCABCABC的顺序打印. 5种方法使 ...
JVM系列之七：HotSpot 虚拟机
1. 对象的创建 1. 遇到 new 指令时,首先检查这个指令的参数是否能在常量池中定位到一个类的符号引用,并且检查这个符号引用代表的类是否已经被加载.解析和初始化过.如果没有,执行相应的类加载. 2 ...
[数据结构 - 第8章] 查找之哈希表（C语言实现）
首先是需要定义一个哈希表的结构以及一些相关的常数.其中 HashTable 就是哈希表结构.结构当中的 elem 为一个动态数组. #define HASHSIZE 12 // 定义哈希表长为数组的长 ...
LeetCode 167：两数之和 II - 输入有序数组 Two Sum II - Input array is sorted
公众号: 爱写bug(ID:icodebugs) 给定一个已按照升序排列的有序数组,找到两个数使得它们相加之和等于目标数. 函数应该返回这两个下标值 index1 和 index2,其中 index ...
【插件】【idea】JRebel mybatisPlus extension是JRebel热部署插件的扩展支持mybatis的xml文件热部署
和JRebel一起使用,修改mybatis的mapper.xml文件不用重启项目 File->Settings->Plugs
windows10安装ubuntu双系统教程(初稿)
windows10安装ubuntu双系统教程(绝对史上最详细) Win10 Ubuntu16.04/Ubuntu18.04双系统完美安装 Windows10+Ubuntu18.04双系统安装成功心得( ...
MyBatis 通过 BATCH 批量提交
本文由简悦 SimpRead 转码, 原文地址 https://www.jb51.net/article/153382.htm 很多人在用 MyBatis 或者通用 Mapper 时,经常会问有没 ...
WPF DataGrid 使用CellTemplateSelector 时SelectTemplate方法Item参数为NULL
首先说明在SelectTemplate中并Item参数并不是真的一直为Null.而是先执行空参数,之后再会执行有参数的. 至于原因我也不知道... 具体验证过程是也就说做好非空检测即可

XGBoost使用教程（进阶篇）三

XGBoost使用教程（进阶篇）三的更多相关文章

随机推荐

热门专题