@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.


To follow this post you’ll need the following packages:

  1. # Install (if necessary)
  2. install.packages(c("xgboost", "tidyverse", "devtools"))
  3. devtools::install_github("drsimonj/pipelearner")
  4. # Attach
  5. library(tidyverse)
  6. library(xgboost)
  7. library(pipelearner)
  8. library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

  1. data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
  2. d <- read_csv(
  3. data_url,
  4. col_names = c('id', 'thinkness', 'size_uniformity',
  5. 'shape_uniformity', 'adhesion', 'epith_size',
  6. 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
  7. select(-id) %>% # Remove id; not useful here
  8. filter(nuclei != '?') %>% # Remove records with missing data
  9. mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
  10. mutate_all(as.numeric) # All to numeric; needed for XGBoost
  11. d
  12. #> # A tibble: 683 × 10
  13. #> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
  14. #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  15. #> 1 5 1 1 1 2 1
  16. #> 2 5 4 4 5 7 10
  17. #> 3 3 1 1 1 2 2
  18. #> 4 6 8 8 1 3 4
  19. #> 5 4 1 1 3 2 1
  20. #> 6 8 10 10 8 7 10
  21. #> 7 1 1 1 1 2 10
  22. #> 8 2 1 2 1 2 1
  23. #> 9 2 1 1 1 2 1
  24. #> 10 4 2 1 1 2 1
  25. #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
  26. #> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>


pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

  1. pipelearner(d, rpart::rpart, cancer ~ .,
  2. minsplit = c(2, 4, 6, 8, 10),
  3. maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula


Here’s an xgboost model:

  1. # Prep data (X) and labels (y)
  2. X <- select(d, -cancer) %>% as.matrix()
  3. y <- d$cancer
  4. # Fit the model
  5. fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
  6. #> [1] train-rmse:0.372184
  7. #> [2] train-rmse:0.288560
  8. #> [3] train-rmse:0.230171
  9. #> [4] train-rmse:0.188965
  10. #> [5] train-rmse:0.158858
  11. # Examine accuracy
  12. predicted <- as.numeric(predict(fit, X) >= .5)
  13. mean(predicted == y)
  14. #> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse data and formula

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

  1. pl_xgboost <- function(data, formula, ...) {
  2. data <- as.data.frame(data)
  3. X_names <- as.character(f_rhs(formula))
  4. y_name <- as.character(f_lhs(formula))
  5. if (X_names == '.') {
  6. X_names <- names(data)[names(data) != y_name]
  7. }
  8. X <- data.matrix(data[, X_names])
  9. y <- data[[y_name]]
  10. xgboost(data = X, label = y, ...)
  11. }

Let’s try it out:

  1. pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
  2. #> [1] train-rmse:0.372184
  3. #> [2] train-rmse:0.288560
  4. #> [3] train-rmse:0.230171
  5. #> [4] train-rmse:0.188965
  6. #> [5] train-rmse:0.158858
  7. # Examine accuracy
  8. pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
  9. mean(pl_predicted == y)
  10. #> [1] 0.9838946


Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

  1. pl <- pipelearner(d, pl_xgboost, cancer ~ .,
  2. nrounds = c(5, 10, 25),
  3. eta = c(.1, .3),
  4. max_depth = c(4, 6))
  5. fits <- pl %>% learn()
  6. #> [1] train-rmse:0.453832
  7. #> [2] train-rmse:0.412548
  8. #> ...
  9. fits
  10. #> # A tibble: 12 × 9
  11. #> models.id cv_pairs.id train_p fit target model
  12. #> <chr> <chr> <dbl> <list> <chr> <chr>
  13. #> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost
  14. #> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost
  15. #> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost
  16. #> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost
  17. #> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost
  18. #> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost
  19. #> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost
  20. #> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost
  21. #> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost
  22. #> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost
  23. #> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost
  24. #> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost
  25. #> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

  1. accuracy <- function(fit, data, target_var) {
  2. # Convert resample object to data frame
  3. data <- as.data.frame(data)
  4. # Get feature matrix and labels
  5. X <- data %>%
  6. select(-matches(target_var)) %>%
  7. as.matrix()
  8. y <- data[[target_var]]
  9. # Obtain predicted class
  10. y_hat <- as.numeric(predict(fit, X) > .5)
  11. # Return accuracy
  12. mean(y_hat == y)
  13. }
  14. results <- fits %>%
  15. mutate(
  16. # hyperparameters
  17. nrounds = map_dbl(params, "nrounds"),
  18. eta = map_dbl(params, "eta"),
  19. max_depth = map_dbl(params, "max_depth"),
  20. # Accuracy
  21. accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
  22. accuracy_test = pmap_dbl(list(fit, test, target), accuracy)
  23. ) %>%
  24. # Select columns and order rows
  25. select(nrounds, eta, max_depth, contains("accuracy")) %>%
  26. arrange(desc(accuracy_test), desc(accuracy_train))
  27. results
  28. #> # A tibble: 12 × 5
  29. #> nrounds eta max_depth accuracy_train accuracy_test
  30. #> <dbl> <dbl> <dbl> <dbl> <dbl>
  31. #> 1 25 0.3 6 1.0000000 0.9489051
  32. #> 2 25 0.3 4 1.0000000 0.9489051
  33. #> 3 10 0.3 6 0.9981685 0.9489051
  34. #> 4 5 0.3 6 0.9945055 0.9489051
  35. #> 5 10 0.1 6 0.9945055 0.9489051
  36. #> 6 25 0.1 6 0.9945055 0.9489051
  37. #> 7 5 0.1 6 0.9926740 0.9489051
  38. #> 8 25 0.1 4 0.9890110 0.9489051
  39. #> 9 10 0.3 4 0.9871795 0.9489051
  40. #> 10 5 0.3 4 0.9853480 0.9489051
  41. #> 11 10 0.1 4 0.9853480 0.9416058
  42. #> 12 5 0.1 4 0.9835165 0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

  1. results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
  2. learn_cvpairs(n = 100) %>%
  3. learn() %>%
  4. mutate(
  5. test_accuracy = pmap_dbl(list(fit, test, target), accuracy)
  6. )
  7. #> [1] train-rmse:0.357471
  8. #> [2] train-rmse:0.256735
  9. #> ...
  10. results %>%
  11. ggplot(aes(test_accuracy)) +
  12. geom_histogram(bins = 30) +
  13. scale_x_continuous(labels = scales::percent) +
  14. theme_minimal() +
  15. labs(x = "Accuracy", y = "Number of samples",
  16. title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you'd like the code that produced this blog, check out the blogR GitHub repository.


