With our powers combined! xgboost and pipelearner
@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.
Why a post on xgboost and pipelearner?
xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!
The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.
Setup
To follow this post you’ll need the following packages:
# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner")
# Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)
Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:
data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
d <- read_csv(
data_url,
col_names = c('id', 'thinkness', 'size_uniformity',
'shape_uniformity', 'adhesion', 'epith_size',
'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
select(-id) %>% # Remove id; not useful here
filter(nuclei != '?') %>% # Remove records with missing data
mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
mutate_all(as.numeric) # All to numeric; needed for XGBoost
d
#> # A tibble: 683 × 10
#> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>
pipelearner
pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.
Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit
and maxdepth
for an rpart decision tree:
pipelearner(d, rpart::rpart, cancer ~ .,
minsplit = c(2, 4, 6, 8, 10),
maxdepth = c(2, 3, 4, 5))
The challenge for xgboost:
pipelearner expects a model function that has two arguments:
data
andformula
xgboost
Here’s an xgboost model:
# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer
# Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858
# Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946
Look like we have a model with 98.39% accuracy on the training data!
Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!
Wrapper function to parse data
and formula
To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data
and formula
, and uses these to pass a feature matrix and label vector to xgboost
:
pl_xgboost <- function(data, formula, ...) {
data <- as.data.frame(data)
X_names <- as.character(f_rhs(formula))
y_name <- as.character(f_lhs(formula))
if (X_names == '.') {
X_names <- names(data)[names(data) != y_name]
}
X <- data.matrix(data[, X_names])
y <- data[[y_name]]
xgboost(data = X, label = y, ...)
}
Let’s try it out:
pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858
# Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946
Perfect!
Bringing it all together
We can now use pipelearner
and pl_xgboost()
for easy grid searching:
pl <- pipelearner(d, pl_xgboost, cancer ~ .,
nrounds = c(5, 10, 25),
eta = c(.1, .3),
max_depth = c(4, 6))
fits <- pl %>% learn()
#> [1] train-rmse:0.453832
#> [2] train-rmse:0.412548
#> ...
fits
#> # A tibble: 12 × 9
#> models.id cv_pairs.id train_p fit target model
#> <chr> <chr> <dbl> <list> <chr> <chr>
#> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>
Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:
accuracy <- function(fit, data, target_var) {
# Convert resample object to data frame
data <- as.data.frame(data)
# Get feature matrix and labels
X <- data %>%
select(-matches(target_var)) %>%
as.matrix()
y <- data[[target_var]]
# Obtain predicted class
y_hat <- as.numeric(predict(fit, X) > .5)
# Return accuracy
mean(y_hat == y)
}
results <- fits %>%
mutate(
# hyperparameters
nrounds = map_dbl(params, "nrounds"),
eta = map_dbl(params, "eta"),
max_depth = map_dbl(params, "max_depth"),
# Accuracy
accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
) %>%
# Select columns and order rows
select(nrounds, eta, max_depth, contains("accuracy")) %>%
arrange(desc(accuracy_test), desc(accuracy_train))
results
#> # A tibble: 12 × 5
#> nrounds eta max_depth accuracy_train accuracy_test
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.3 6 1.0000000 0.9489051
#> 2 25 0.3 4 1.0000000 0.9489051
#> 3 10 0.3 6 0.9981685 0.9489051
#> 4 5 0.3 6 0.9945055 0.9489051
#> 5 10 0.1 6 0.9945055 0.9489051
#> 6 25 0.1 6 0.9945055 0.9489051
#> 7 5 0.1 6 0.9926740 0.9489051
#> 8 25 0.1 4 0.9890110 0.9489051
#> 9 10 0.3 4 0.9871795 0.9489051
#> 10 5 0.3 4 0.9853480 0.9489051
#> 11 10 0.1 4 0.9853480 0.9416058
#> 12 5 0.1 4 0.9835165 0.9416058
Our top model, which got 94.89% on a test set, had nrounds
= 25, eta
= 0.3, and max_depth
= 6.
Either way, the trick was the wrapper function pl_xgboost()
that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.
Bonus: bootstrapped cross validation
For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!
results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
learn_cvpairs(n = 100) %>%
learn() %>%
mutate(
test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
)
#> [1] train-rmse:0.357471
#> [2] train-rmse:0.256735
#> ...
results %>%
ggplot(aes(test_accuracy)) +
geom_histogram(bins = 30) +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Accuracy", y = "Number of samples",
title = "Test accuracy distribution for\n100 bootstrapped samples")
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner
With our powers combined! xgboost and pipelearner的更多相关文章
- xgboost入门与实战(原理篇)
sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...
- xgboost原理及并行实现
XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...
- 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优
摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...
- 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost
系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...
- CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)
1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort 暴力枚举,水 1.题意:n*m的数组, ...
- 【原创】xgboost 特征评分的计算原理
xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...
- Ubuntu: ImportError: No module named xgboost
ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...
- windows下安装xgboost
Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...
- xgboost原理及应用
1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...
随机推荐
- spring_boot攻略1.1-hello SpringBoot
交流账号:2318645572 说明: 开发工具:eclipse 开发系统:windows 7 开发规范:maven项目 注意:按照我说的方式做下去 1.导包:pom.xml <project ...
- jsp之session对象
jsp之session对象:一:概念session对象可以在应用程序的web页面之间跳转时保存用户的信息,使整个用户会话一直存在,直到关闭浏览器或是销毁session.session的生命周期:20~ ...
- StringBuilder的实现
先看看MS给出的官方解释吧 (http://msdn.microsoft.com/zh-cn/library/system.text.stringbuilder(VS.80).aspx) String ...
- SpringMVC4+MyBatis+SQL Server2014 基于SqlSession实现读写分离(也可以实现主从分离)
前言 上篇文章我觉的使用拦截器虽然方便快捷,但是在使用读串还是写串上你无法控制,我更希望我们像jdbc那样可以手动控制我使用读写串,那么这篇则在sqlsession的基础上实现读写分离, 这种方式则需 ...
- LeetCode 376. Wiggle Subsequence 摆动子序列
原题 A sequence of numbers is called a wiggle sequence if the differences between successive numbers s ...
- WPF触屏Touch事件在嵌套控件中的响应问题
前几天遇到个touch事件的坑,记录下来以增强理解. 具体是 想把一个listview嵌套到另一个listview,这时候如果list view(子listview)的内容过多超过容器高度,它是不会出 ...
- Struts2框架的基本使用
前面已经介绍过了MVC思想,Struts2是一个优秀的MVC框架,大大降低了各个层之间的耦合度,具有很好的扩展性.从本篇开始我们学习Struts2的基本用法,本篇主要包括以下内容: Stru ...
- [ext4]13 空间管理 - Prealloc分配机制
作者:Younger Liu, 本作品采用知识共享署名-非商业性使用-相同方式共享 3.0 未本地化版本许可协议进行许可. 在ext4系统中,对于小文件和大文件的空间申请请求,都有不同的分配策略 ...
- Laravel5中Cookie的使用
今天在Laravel框架中使用Cookie的时候,碰到了点问题,自己被迷糊折腾了半多小时.期间研究了Cookie的实现类,也在网站找了许多的资料,包括问答.发现并没有解决问题.网上的答案都是互相抄袭, ...
- Unity 多屏(分屏)显示,Muti_Display
Unity 多屏(分屏)显示,Muti_Display 最近项目有个需求,主要用于在展厅的展示游戏. 比如,在一个很大的展厅,很大的显示屏挂在墙上,我们不可能通过操作墙上那块显示器上的按钮来控制游戏 ...