With our powers combined! xgboost and pipelearner
@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.
Why a post on xgboost and pipelearner?
xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!
The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.
Setup
To follow this post you’ll need the following packages:
# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner")
# Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)
Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:
data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
d <- read_csv(
data_url,
col_names = c('id', 'thinkness', 'size_uniformity',
'shape_uniformity', 'adhesion', 'epith_size',
'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
select(-id) %>% # Remove id; not useful here
filter(nuclei != '?') %>% # Remove records with missing data
mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
mutate_all(as.numeric) # All to numeric; needed for XGBoost
d
#> # A tibble: 683 × 10
#> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>
pipelearner
pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.
Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit
and maxdepth
for an rpart decision tree:
pipelearner(d, rpart::rpart, cancer ~ .,
minsplit = c(2, 4, 6, 8, 10),
maxdepth = c(2, 3, 4, 5))
The challenge for xgboost:
pipelearner expects a model function that has two arguments:
data
andformula
xgboost
Here’s an xgboost model:
# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer
# Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858
# Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946
Look like we have a model with 98.39% accuracy on the training data!
Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!
Wrapper function to parse data
and formula
To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data
and formula
, and uses these to pass a feature matrix and label vector to xgboost
:
pl_xgboost <- function(data, formula, ...) {
data <- as.data.frame(data)
X_names <- as.character(f_rhs(formula))
y_name <- as.character(f_lhs(formula))
if (X_names == '.') {
X_names <- names(data)[names(data) != y_name]
}
X <- data.matrix(data[, X_names])
y <- data[[y_name]]
xgboost(data = X, label = y, ...)
}
Let’s try it out:
pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858
# Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946
Perfect!
Bringing it all together
We can now use pipelearner
and pl_xgboost()
for easy grid searching:
pl <- pipelearner(d, pl_xgboost, cancer ~ .,
nrounds = c(5, 10, 25),
eta = c(.1, .3),
max_depth = c(4, 6))
fits <- pl %>% learn()
#> [1] train-rmse:0.453832
#> [2] train-rmse:0.412548
#> ...
fits
#> # A tibble: 12 × 9
#> models.id cv_pairs.id train_p fit target model
#> <chr> <chr> <dbl> <list> <chr> <chr>
#> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>
Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:
accuracy <- function(fit, data, target_var) {
# Convert resample object to data frame
data <- as.data.frame(data)
# Get feature matrix and labels
X <- data %>%
select(-matches(target_var)) %>%
as.matrix()
y <- data[[target_var]]
# Obtain predicted class
y_hat <- as.numeric(predict(fit, X) > .5)
# Return accuracy
mean(y_hat == y)
}
results <- fits %>%
mutate(
# hyperparameters
nrounds = map_dbl(params, "nrounds"),
eta = map_dbl(params, "eta"),
max_depth = map_dbl(params, "max_depth"),
# Accuracy
accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
) %>%
# Select columns and order rows
select(nrounds, eta, max_depth, contains("accuracy")) %>%
arrange(desc(accuracy_test), desc(accuracy_train))
results
#> # A tibble: 12 × 5
#> nrounds eta max_depth accuracy_train accuracy_test
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.3 6 1.0000000 0.9489051
#> 2 25 0.3 4 1.0000000 0.9489051
#> 3 10 0.3 6 0.9981685 0.9489051
#> 4 5 0.3 6 0.9945055 0.9489051
#> 5 10 0.1 6 0.9945055 0.9489051
#> 6 25 0.1 6 0.9945055 0.9489051
#> 7 5 0.1 6 0.9926740 0.9489051
#> 8 25 0.1 4 0.9890110 0.9489051
#> 9 10 0.3 4 0.9871795 0.9489051
#> 10 5 0.3 4 0.9853480 0.9489051
#> 11 10 0.1 4 0.9853480 0.9416058
#> 12 5 0.1 4 0.9835165 0.9416058
Our top model, which got 94.89% on a test set, had nrounds
= 25, eta
= 0.3, and max_depth
= 6.
Either way, the trick was the wrapper function pl_xgboost()
that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.
Bonus: bootstrapped cross validation
For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!
results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
learn_cvpairs(n = 100) %>%
learn() %>%
mutate(
test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
)
#> [1] train-rmse:0.357471
#> [2] train-rmse:0.256735
#> ...
results %>%
ggplot(aes(test_accuracy)) +
geom_histogram(bins = 30) +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Accuracy", y = "Number of samples",
title = "Test accuracy distribution for\n100 bootstrapped samples")
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner
With our powers combined! xgboost and pipelearner的更多相关文章
- xgboost入门与实战(原理篇)
sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...
- xgboost原理及并行实现
XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...
- 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优
摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...
- 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost
系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...
- CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)
1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort 暴力枚举,水 1.题意:n*m的数组, ...
- 【原创】xgboost 特征评分的计算原理
xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...
- Ubuntu: ImportError: No module named xgboost
ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...
- windows下安装xgboost
Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...
- xgboost原理及应用
1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...
随机推荐
- DAM的使用结合串口和中断以及GPIO。
DAM的使用结合串口和中断以及GPIO. 当我学到DMA这章的时候就意味着我已经学完了,GPIO里的LED,按键,还有就是串口发送数据. 那么下面就来总结下前段时间所学的知识(因为接下来有断时间我是没 ...
- 微信小程序入门学习
前(che)言(dan): 近几天,微信小程序的内测引起了众多开发人员的热议,很多人都认为这将会成为一大热门,那么好吧,虽然我是一个小白,但这是个新玩意,花点时间稍稍钻研一下也是无妨的,谁让我没有女朋 ...
- 串口屏与触摸屏人机界面组态软件HMIMaker介绍
串口屏与触摸屏人机界面组态软件HMIMaker介绍 触摸屏人机界面组态软件HMIMaker,是一款基于ARM架构的嵌入式控制系统开发的嵌入式软件,专业应用于触摸屏的二级界面开发,具有单片机协议,mod ...
- 使用nodejs进行WEB开发
这里,准备从零开始用nodejs实现一个微博系统.功能包括路由控制.页面模板.数据库访问.用户注册.登录.用户会话等内容. 将会介绍Express框架.MVC设计模式.ejs模板引擎以及MongoDB ...
- lua 字符串
lua 字符串 语法 单引号 双引号 "[[字符串]]" 示例程序 local name1 = 'liao1' local name2 = "liao2" lo ...
- 【算法】RMQ LCA 讲课杂记
4月4日,应学弟要求去了次学校给小同学们讲了一堂课,其实讲的挺内容挺杂的,但是目的是引出LCA算法. 现在整理一下当天讲课的主要内容: 开始并没有直接引出LCA问题,而是讲了RMQ(Range Min ...
- STM32实战应用(一)——1602蓝牙时钟1液晶的显示测试
前言 从51到STM32F4学习这么久了,总算找到点头绪了,目前学习了GPIO,中断,定时器,看门狗的基本使用,所以想试着看看能不能做个什么东西,就是想复习一下最近学习的知识.正好上学期单片机课程设计 ...
- git提交如何忽略某些文件
在使用git对项目进行版本管理的时候,我们总有一些不需要提交到版本库里的文件和文件夹,这个时候我们就需要让git自动忽略掉一下文件. 使用.gitignore忽略文件 为了让git忽略指定的文件和文件 ...
- storage在IE8下的兼容性写法
storage 本地缓存,这是HTML5的一个非常好用的地方,具体好用在哪,网上可以找到很多,但是我觉得总结的都不是很完整,我建议大家有空的话可以看下JavaScript权威指南这本书,里面对于这个方 ...
- react基于nodejs简单的搭建与开发方法
只需安装babel命令,即可将react的jsx写法转换成浏览器认识的js写法 1.安装nodejs(百度下载安装即可,自带npm) 2.cmd打开命令行,cd进入在自己的文件夹下 执行命令: npm ...