@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner") # Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

d <- read_csv(
data_url,
col_names = c('id', 'thinkness', 'size_uniformity',
'shape_uniformity', 'adhesion', 'epith_size',
'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
select(-id) %>% # Remove id; not useful here
filter(nuclei != '?') %>% # Remove records with missing data
mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
mutate_all(as.numeric) # All to numeric; needed for XGBoost d
#> # A tibble: 683 × 10
#> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

pipelearner(d, rpart::rpart, cancer ~ .,
minsplit = c(2, 4, 6, 8, 10),
maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer # Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

pl_xgboost <- function(data, formula, ...) {
data <- as.data.frame(data) X_names <- as.character(f_rhs(formula))
y_name <- as.character(f_lhs(formula)) if (X_names == '.') {
X_names <- names(data)[names(data) != y_name]
} X <- data.matrix(data[, X_names])
y <- data[[y_name]] xgboost(data = X, label = y, ...)
}

Let’s try it out:

pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1] train-rmse:0.372184
#> [2] train-rmse:0.288560
#> [3] train-rmse:0.230171
#> [4] train-rmse:0.188965
#> [5] train-rmse:0.158858 # Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

pl <- pipelearner(d, pl_xgboost, cancer ~ .,
nrounds = c(5, 10, 25),
eta = c(.1, .3),
max_depth = c(4, 6)) fits <- pl %>% learn()
#> [1] train-rmse:0.453832
#> [2] train-rmse:0.412548
#> ... fits
#> # A tibble: 12 × 9
#> models.id cv_pairs.id train_p fit target model
#> <chr> <chr> <dbl> <list> <chr> <chr>
#> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

accuracy <- function(fit, data, target_var) {
# Convert resample object to data frame
data <- as.data.frame(data)
# Get feature matrix and labels
X <- data %>%
select(-matches(target_var)) %>%
as.matrix()
y <- data[[target_var]]
# Obtain predicted class
y_hat <- as.numeric(predict(fit, X) > .5)
# Return accuracy
mean(y_hat == y)
} results <- fits %>%
mutate(
# hyperparameters
nrounds = map_dbl(params, "nrounds"),
eta = map_dbl(params, "eta"),
max_depth = map_dbl(params, "max_depth"),
# Accuracy
accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
) %>%
# Select columns and order rows
select(nrounds, eta, max_depth, contains("accuracy")) %>%
arrange(desc(accuracy_test), desc(accuracy_train)) results
#> # A tibble: 12 × 5
#> nrounds eta max_depth accuracy_train accuracy_test
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.3 6 1.0000000 0.9489051
#> 2 25 0.3 4 1.0000000 0.9489051
#> 3 10 0.3 6 0.9981685 0.9489051
#> 4 5 0.3 6 0.9945055 0.9489051
#> 5 10 0.1 6 0.9945055 0.9489051
#> 6 25 0.1 6 0.9945055 0.9489051
#> 7 5 0.1 6 0.9926740 0.9489051
#> 8 25 0.1 4 0.9890110 0.9489051
#> 9 10 0.3 4 0.9871795 0.9489051
#> 10 5 0.3 4 0.9853480 0.9489051
#> 11 10 0.1 4 0.9853480 0.9416058
#> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
learn_cvpairs(n = 100) %>%
learn() %>%
mutate(
test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
)
#> [1] train-rmse:0.357471
#> [2] train-rmse:0.256735
#> ... results %>%
ggplot(aes(test_accuracy)) +
geom_histogram(bins = 30) +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Accuracy", y = "Number of samples",
title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

With our powers combined! xgboost and pipelearner的更多相关文章

  1. xgboost入门与实战(原理篇)

    sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...

  2. xgboost原理及并行实现

    XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...

  3. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  4. 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost

    系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...

  5. CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)

    1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort    暴力枚举,水 1.题意:n*m的数组, ...

  6. 【原创】xgboost 特征评分的计算原理

    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...

  7. Ubuntu: ImportError: No module named xgboost

    ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...

  8. windows下安装xgboost

    Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...

  9. xgboost原理及应用

    1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...

随机推荐

  1. hibernate的反转引擎生成两个实体类的问题

    在使用myeclipse中自带的hibernate 进行jsp开发时候遇到了这个问题.使用hibernate的反转引擎从数据库生成生成实体类,一个表生成了两个类,xx.java和xxId.java . ...

  2. 微信小程序登录数据解密以及状态维持

    学习过小程序的朋友应该知道,在小程序中是不支持cookie的,借助小程序中的缓存我们也可以存储一些信息,但是对于一些比较重要的信息,我们需要通过登录状态维持来保存,同时,为了安全起见,用户的敏感信息, ...

  3. 【canvas系列】canvas实现“ 简单的Amaziograph效果”--画对称图

    标题很难引人入胜,先放个效果图好了 如果图片吸引不了你,那我觉得也就没啥看的了. demo链接: https://win7killer.github.io/can_demo/demo/draw_rol ...

  4. Linux查看网络端口

    简单的总结一下前段时间学习Linux的成果 查看 TCP 22 端口是否打开1.列出所有端口:[root@Demon proc]# netstat -ntlpActive Internet conne ...

  5. Maven(二)之Maven项目构建演练

    从上一篇的讲解中我们知道了什么是Maven,然后它的安装配置,到修改本地仓库,这篇我们用一个实际的例子,带领大家走进我们的Maven之旅.让我们一起来体验一下Maven的高度自动化构建项目的过程. 一 ...

  6. wow.js中各种特效对应的类名

    一.(页面在向下滚动的时候,有些元素会产生细小的动画效果.虽然动画比较小,但却能吸引你的注意.) 刚知道wow.js这个插件,之前访问别的网站下拉滚动条会出现各种效果感觉特别神奇,现在自己依葫芦画瓢也 ...

  7. Python数据处理——numpy_1

    python中数据处理最基础的一个包--numpy.它能很好的进行数据准备,类似与R语言中的数据框(DataFrame)一样.今天,就来从最基础的开始学习. import numpy as npdat ...

  8. Flask 学习笔记

    Flask 是一个Web应用框架,我也就是一边看书,一边写博文做记录 这本书: 首先安装Flask ,和配置环境,参考这边博客: 然后就开始学习Flask 了. 1.Application and R ...

  9. python自动化测试应用-番外篇--接口测试2

    篇2                 book-python-auto-test-番外篇--接口测试2 --lamecho辣么丑 大家好! 我是lamecho(辣么丑),今天将继续上一篇python接 ...

  10. 微信小程序后台音乐播放注意事项

    wx.seekBackgroundAudio(OBJECT) 作用:控制音乐播放进度. 注意: 该事件 会触发 wx.onBackgroundAudioPlay(CALLBACK) 事件 ,也就是相当 ...