@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

  1. # Install (if necessary)
  2. install.packages(c("xgboost", "tidyverse", "devtools"))
  3. devtools::install_github("drsimonj/pipelearner")
  4. # Attach
  5. library(tidyverse)
  6. library(xgboost)
  7. library(pipelearner)
  8. library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

  1. data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
  2. d <- read_csv(
  3. data_url,
  4. col_names = c('id', 'thinkness', 'size_uniformity',
  5. 'shape_uniformity', 'adhesion', 'epith_size',
  6. 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
  7. select(-id) %>% # Remove id; not useful here
  8. filter(nuclei != '?') %>% # Remove records with missing data
  9. mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
  10. mutate_all(as.numeric) # All to numeric; needed for XGBoost
  11. d
  12. #> # A tibble: 683 × 10
  13. #> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
  14. #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  15. #> 1 5 1 1 1 2 1
  16. #> 2 5 4 4 5 7 10
  17. #> 3 3 1 1 1 2 2
  18. #> 4 6 8 8 1 3 4
  19. #> 5 4 1 1 3 2 1
  20. #> 6 8 10 10 8 7 10
  21. #> 7 1 1 1 1 2 10
  22. #> 8 2 1 2 1 2 1
  23. #> 9 2 1 1 1 2 1
  24. #> 10 4 2 1 1 2 1
  25. #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
  26. #> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

  1. pipelearner(d, rpart::rpart, cancer ~ .,
  2. minsplit = c(2, 4, 6, 8, 10),
  3. maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

  1. # Prep data (X) and labels (y)
  2. X <- select(d, -cancer) %>% as.matrix()
  3. y <- d$cancer
  4. # Fit the model
  5. fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
  6. #> [1] train-rmse:0.372184
  7. #> [2] train-rmse:0.288560
  8. #> [3] train-rmse:0.230171
  9. #> [4] train-rmse:0.188965
  10. #> [5] train-rmse:0.158858
  11. # Examine accuracy
  12. predicted <- as.numeric(predict(fit, X) >= .5)
  13. mean(predicted == y)
  14. #> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

  1. pl_xgboost <- function(data, formula, ...) {
  2. data <- as.data.frame(data)
  3. X_names <- as.character(f_rhs(formula))
  4. y_name <- as.character(f_lhs(formula))
  5. if (X_names == '.') {
  6. X_names <- names(data)[names(data) != y_name]
  7. }
  8. X <- data.matrix(data[, X_names])
  9. y <- data[[y_name]]
  10. xgboost(data = X, label = y, ...)
  11. }

Let’s try it out:

  1. pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
  2. #> [1] train-rmse:0.372184
  3. #> [2] train-rmse:0.288560
  4. #> [3] train-rmse:0.230171
  5. #> [4] train-rmse:0.188965
  6. #> [5] train-rmse:0.158858
  7. # Examine accuracy
  8. pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
  9. mean(pl_predicted == y)
  10. #> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

  1. pl <- pipelearner(d, pl_xgboost, cancer ~ .,
  2. nrounds = c(5, 10, 25),
  3. eta = c(.1, .3),
  4. max_depth = c(4, 6))
  5. fits <- pl %>% learn()
  6. #> [1] train-rmse:0.453832
  7. #> [2] train-rmse:0.412548
  8. #> ...
  9. fits
  10. #> # A tibble: 12 × 9
  11. #> models.id cv_pairs.id train_p fit target model
  12. #> <chr> <chr> <dbl> <list> <chr> <chr>
  13. #> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
  14. #> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
  15. #> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
  16. #> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
  17. #> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
  18. #> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
  19. #> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
  20. #> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
  21. #> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
  22. #> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
  23. #> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
  24. #> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
  25. #> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

  1. accuracy <- function(fit, data, target_var) {
  2. # Convert resample object to data frame
  3. data <- as.data.frame(data)
  4. # Get feature matrix and labels
  5. X <- data %>%
  6. select(-matches(target_var)) %>%
  7. as.matrix()
  8. y <- data[[target_var]]
  9. # Obtain predicted class
  10. y_hat <- as.numeric(predict(fit, X) > .5)
  11. # Return accuracy
  12. mean(y_hat == y)
  13. }
  14. results <- fits %>%
  15. mutate(
  16. # hyperparameters
  17. nrounds = map_dbl(params, "nrounds"),
  18. eta = map_dbl(params, "eta"),
  19. max_depth = map_dbl(params, "max_depth"),
  20. # Accuracy
  21. accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
  22. accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
  23. ) %>%
  24. # Select columns and order rows
  25. select(nrounds, eta, max_depth, contains("accuracy")) %>%
  26. arrange(desc(accuracy_test), desc(accuracy_train))
  27. results
  28. #> # A tibble: 12 × 5
  29. #> nrounds eta max_depth accuracy_train accuracy_test
  30. #> <dbl> <dbl> <dbl> <dbl> <dbl>
  31. #> 1 25 0.3 6 1.0000000 0.9489051
  32. #> 2 25 0.3 4 1.0000000 0.9489051
  33. #> 3 10 0.3 6 0.9981685 0.9489051
  34. #> 4 5 0.3 6 0.9945055 0.9489051
  35. #> 5 10 0.1 6 0.9945055 0.9489051
  36. #> 6 25 0.1 6 0.9945055 0.9489051
  37. #> 7 5 0.1 6 0.9926740 0.9489051
  38. #> 8 25 0.1 4 0.9890110 0.9489051
  39. #> 9 10 0.3 4 0.9871795 0.9489051
  40. #> 10 5 0.3 4 0.9853480 0.9489051
  41. #> 11 10 0.1 4 0.9853480 0.9416058
  42. #> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

  1. results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
  2. learn_cvpairs(n = 100) %>%
  3. learn() %>%
  4. mutate(
  5. test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
  6. )
  7. #> [1] train-rmse:0.357471
  8. #> [2] train-rmse:0.256735
  9. #> ...
  10. results %>%
  11. ggplot(aes(test_accuracy)) +
  12. geom_histogram(bins = 30) +
  13. scale_x_continuous(labels = scales::percent) +
  14. theme_minimal() +
  15. labs(x = "Accuracy", y = "Number of samples",
  16. title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

With our powers combined! xgboost and pipelearner的更多相关文章

  1. xgboost入门与实战(原理篇)

    sklearn实战-乳腺癌细胞数据挖掘 https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campai ...

  2. xgboost原理及并行实现

    XGBoost训练: It is not easy to train all the trees at once. Instead, we use an additive strategy: fix ...

  3. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  4. 在Windows10 64位 Anaconda4 Python3.5下安装XGBoost

    系统环境: Windows10 64bit Anaconda4 Python3.5.1 软件安装: Git for Windows MINGW 在安装的时候要改一个选择(Architecture选择x ...

  5. CF Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined)

    1. Intel Code Challenge Final Round (Div. 1 + Div. 2, Combined) B. Batch Sort    暴力枚举,水 1.题意:n*m的数组, ...

  6. 【原创】xgboost 特征评分的计算原理

    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算: 而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性, 调用的源码就不准备详述,本文主要侧重的 ...

  7. Ubuntu: ImportError: No module named xgboost

    ImportError: No module named xgboost 解决办法: git clone --recursive https://github.com/dmlc/xgboost cd ...

  8. windows下安装xgboost

    Note that as of the most recent release the Microsoft Visual Studio instructions no longer seem to a ...

  9. xgboost原理及应用

    1.背景 关于xgboost的原理网络上的资源很少,大多数还停留在应用层面,本文通过学习陈天奇博士的PPT 地址和xgboost导读和实战 地址,希望对xgboost原理进行深入理解. 2.xgboo ...

随机推荐

  1. spring_boot攻略1.1-hello SpringBoot

    交流账号:2318645572 说明: 开发工具:eclipse 开发系统:windows 7 开发规范:maven项目 注意:按照我说的方式做下去 1.导包:pom.xml <project ...

  2. jsp之session对象

    jsp之session对象:一:概念session对象可以在应用程序的web页面之间跳转时保存用户的信息,使整个用户会话一直存在,直到关闭浏览器或是销毁session.session的生命周期:20~ ...

  3. StringBuilder的实现

    先看看MS给出的官方解释吧 (http://msdn.microsoft.com/zh-cn/library/system.text.stringbuilder(VS.80).aspx) String ...

  4. SpringMVC4+MyBatis+SQL Server2014 基于SqlSession实现读写分离(也可以实现主从分离)

    前言 上篇文章我觉的使用拦截器虽然方便快捷,但是在使用读串还是写串上你无法控制,我更希望我们像jdbc那样可以手动控制我使用读写串,那么这篇则在sqlsession的基础上实现读写分离, 这种方式则需 ...

  5. LeetCode 376. Wiggle Subsequence 摆动子序列

    原题 A sequence of numbers is called a wiggle sequence if the differences between successive numbers s ...

  6. WPF触屏Touch事件在嵌套控件中的响应问题

    前几天遇到个touch事件的坑,记录下来以增强理解. 具体是 想把一个listview嵌套到另一个listview,这时候如果list view(子listview)的内容过多超过容器高度,它是不会出 ...

  7. Struts2框架的基本使用

         前面已经介绍过了MVC思想,Struts2是一个优秀的MVC框架,大大降低了各个层之间的耦合度,具有很好的扩展性.从本篇开始我们学习Struts2的基本用法,本篇主要包括以下内容: Stru ...

  8. [ext4]13 空间管理 - Prealloc分配机制

     作者:Younger Liu, 本作品采用知识共享署名-非商业性使用-相同方式共享 3.0 未本地化版本许可协议进行许可. 在ext4系统中,对于小文件和大文件的空间申请请求,都有不同的分配策略 ...

  9. Laravel5中Cookie的使用

    今天在Laravel框架中使用Cookie的时候,碰到了点问题,自己被迷糊折腾了半多小时.期间研究了Cookie的实现类,也在网站找了许多的资料,包括问答.发现并没有解决问题.网上的答案都是互相抄袭, ...

  10. Unity 多屏(分屏)显示,Muti_Display

    Unity 多屏(分屏)显示,Muti_Display  最近项目有个需求,主要用于在展厅的展示游戏. 比如,在一个很大的展厅,很大的显示屏挂在墙上,我们不可能通过操作墙上那块显示器上的按钮来控制游戏 ...