With our powers combined! xgboost and pipelearner

@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

Why a post on xgboost and pipelearner?

xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

Setup

To follow this post you’ll need the following packages:

# Install (if necessary)
install.packages(c("xgboost", "tidyverse", "devtools"))
devtools::install_github("drsimonj/pipelearner")
# Attach
library(tidyverse)
library(xgboost)
library(pipelearner)
library(lazyeval)

Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
d <- read_csv(
  data_url,
  col_names = c('id', 'thinkness', 'size_uniformity',
                'shape_uniformity', 'adhesion', 'epith_size',
                'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>%
  select(-id) %>%            # Remove id; not useful here
  filter(nuclei != '?') %>%  # Remove records with missing data
  mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
  mutate_all(as.numeric)     # All to numeric; needed for XGBoost
d
#> # A tibble: 683 × 10
#>    thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
#>        <dbl>           <dbl>            <dbl>    <dbl>      <dbl>  <dbl>
#> 1          5               1                1        1          2      1
#> 2          5               4                4        5          7     10
#> 3          3               1                1        1          2      2
#> 4          6               8                8        1          3      4
#> 5          4               1                1        3          2      1
#> 6          8              10               10        8          7     10
#> 7          1               1                1        1          2     10
#> 8          2               1                2        1          2      1
#> 9          2               1                1        1          2      1
#> 10         4               2                1        1          2      1
#> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
#> #   nucleoli <dbl>, mitoses <dbl>, cancer <dbl>

pipelearner

pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

pipelearner(d, rpart::rpart, cancer ~ .,
            minsplit = c(2, 4, 6, 8, 10),
            maxdepth = c(2, 3, 4, 5))

The challenge for xgboost:

pipelearner expects a model function that has two arguments: data andformula

xgboost

Here’s an xgboost model:

# Prep data (X) and labels (y)
X <- select(d, -cancer) %>% as.matrix()
y <- d$cancer
# Fit the model
fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
#> [1]  train-rmse:0.372184
#> [2]  train-rmse:0.288560
#> [3]  train-rmse:0.230171
#> [4]  train-rmse:0.188965
#> [5]  train-rmse:0.158858
# Examine accuracy
predicted <- as.numeric(predict(fit, X) >= .5)
mean(predicted == y)
#> [1] 0.9838946

Look like we have a model with 98.39% accuracy on the training data!

Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

Wrapper function to parse `data` and `formula`

To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

pl_xgboost <- function(data, formula, ...) {
  data <- as.data.frame(data)
  X_names <- as.character(f_rhs(formula))
  y_name  <- as.character(f_lhs(formula))
  if (X_names == '.') {
    X_names <- names(data)[names(data) != y_name]
  }
  X <- data.matrix(data[, X_names])
  y <- data[[y_name]]
  xgboost(data = X, label = y, ...)
}

Let’s try it out:

pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
#> [1]  train-rmse:0.372184
#> [2]  train-rmse:0.288560
#> [3]  train-rmse:0.230171
#> [4]  train-rmse:0.188965
#> [5]  train-rmse:0.158858
# Examine accuracy
pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
mean(pl_predicted == y)
#> [1] 0.9838946

Perfect!

Bringing it all together

We can now use pipelearner and pl_xgboost() for easy grid searching:

pl <- pipelearner(d, pl_xgboost, cancer ~ .,
                  nrounds = c(5, 10, 25),
                  eta = c(.1, .3),
                  max_depth = c(4, 6))
fits <- pl %>% learn()
#> [1]  train-rmse:0.453832
#> [2]  train-rmse:0.412548
#> ...
fits
#> # A tibble: 12 × 9
#>    models.id cv_pairs.id train_p               fit target      model
#>        <chr>       <chr>   <dbl>            <list>  <chr>      <chr>
#> 1          1           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 2         10           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 3         11           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 4         12           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 5          2           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 6          3           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 7          4           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 8          5           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 9          6           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 10         7           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 11         8           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> 12         9           1       1 <S3: xgb.Booster> cancer pl_xgboost
#> # ... with 3 more variables: params <list>, train <list>, test <list>

Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

accuracy <- function(fit, data, target_var) {
  # Convert resample object to data frame
  data <- as.data.frame(data)
  # Get feature matrix and labels
  X <- data %>%
    select(-matches(target_var)) %>%
    as.matrix()
  y <- data[[target_var]]
  # Obtain predicted class
  y_hat <- as.numeric(predict(fit, X) > .5)
  # Return accuracy
  mean(y_hat == y)
}
results <- fits %>%
  mutate(
    # hyperparameters
    nrounds   = map_dbl(params, "nrounds"),
    eta       = map_dbl(params, "eta"),
    max_depth = map_dbl(params, "max_depth"),
    # Accuracy
    accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
    accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
  ) %>%
  # Select columns and order rows
  select(nrounds, eta, max_depth, contains("accuracy")) %>%
  arrange(desc(accuracy_test), desc(accuracy_train))
results
#> # A tibble: 12 × 5
#>    nrounds   eta max_depth accuracy_train accuracy_test
#>      <dbl> <dbl>     <dbl>          <dbl>         <dbl>
#> 1       25   0.3         6      1.0000000     0.9489051
#> 2       25   0.3         4      1.0000000     0.9489051
#> 3       10   0.3         6      0.9981685     0.9489051
#> 4        5   0.3         6      0.9945055     0.9489051
#> 5       10   0.1         6      0.9945055     0.9489051
#> 6       25   0.1         6      0.9945055     0.9489051
#> 7        5   0.1         6      0.9926740     0.9489051
#> 8       25   0.1         4      0.9890110     0.9489051
#> 9       10   0.3         4      0.9871795     0.9489051
#> 10       5   0.3         4      0.9853480     0.9489051
#> 11      10   0.1         4      0.9853480     0.9416058
#> 12       5   0.1         4      0.9835165     0.9416058

Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

Bonus: bootstrapped cross validation

For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>%
  learn_cvpairs(n = 100) %>%
  learn() %>%
  mutate(
    test_accuracy  = pmap_dbl(list(fit, test,  target), accuracy)
  )
#> [1]  train-rmse:0.357471
#> [2]  train-rmse:0.256735
#> ...
results %>%
  ggplot(aes(test_accuracy)) +
    geom_histogram(bins = 30) +
    scale_x_continuous(labels = scales::percent) +
    theme_minimal() +
    labs(x = "Accuracy", y = "Number of samples",
         title = "Test accuracy distribution for\n100 bootstrapped samples")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

转自：https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner