suppressPackageStartupMessages({
library(tidymodels)
library(tidyverse)
library(bonsai)
library(modeldata)
library(mirai)
})
I’ve been testing out mirai
ever since I saw Charlie Gao and Will Landau’s joint talk. TLDR: it’s really good. I didn’t realize how much I needed asynchronous map
functions until I had them. In this post, I’ll demo an application of mirai
to speed up the tuning process for a bunch of tidymodel
workflows.
Packages
We’ll load the usual tidy libraries plus bonsai
(for the aorsf
engine), modeldata
(for the meats
data), and mirai
.
Data
We’ll use the meats
data from the modeldata
package for this post. Some details taken from ?meats
are below.
These data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents. For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.
A special thank you to the instrument and company name (Tecator) for making these data publicly available, and a thank you to the modeldata
maintainers as well.
<- modeldata::meats
data_meats
data_meats
Resamples
We’ll be comparing different workflows to develop a prediction model in this post, so making a resampling object is a required step.
set.seed(321)
<- vfold_cv(data_meats)
data_resamples
data_resamples
Recipe factory
The recipes
package makes it so easy to compose many recipes with slightly different steps. Here, I start with a template to guide which steps to include in my recipe.
<- tidyr::expand_grid(
recipes_init boxcox = c(TRUE, FALSE),
pca = c(TRUE, FALSE),
poly = c(TRUE, FALSE)
)
recipes_init
Next I go through the template line by line and make a recipe with each specification. What I like about this is the control on order and inclusion of specific steps. You can even indicate that certain steps in the recipe
should be tuned, but we’ll leave that for another post.
# Generate recipes based on a grid of parameters
<- recipes_init %>%
recipe_list mutate(
recipe = pmap(
.l = list(boxcox, pca, poly),
.f = function(use_boxcox, use_pca, use_poly){
<- recipe(protein ~ ., data = data_meats)
rec_init
if (use_boxcox) {
<- rec_init %>%
rec_init step_BoxCox(all_numeric_predictors())
}
if (use_pca) {
<- rec_init %>%
rec_init step_pca(all_numeric_predictors(), num_comp = 10)
}
if(use_poly) {
<- rec_init %>%
rec_init step_poly(all_numeric_predictors())
}
rec_init
}
)%>%
) mutate(
boxcox = factor(boxcox, labels = c("box.no", "box.yes")),
pca = factor(pca, labels = c("pca.no", "pca.yes")),
poly = factor(poly, labels = c("poly.no", "poly.yes"))
%>%
) unite(col = 'name', boxcox, pca, poly) %>%
mutate(name = paste0(name, '.')) %>%
deframe()
# check out the first recipe in the list
1]] recipe_list[[
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 102
── Operations
• Box-Cox transformation on: all_numeric_predictors()
• PCA extraction with: all_numeric_predictors()
• Orthogonal polynomials on: all_numeric_predictors()
Model specifications
We’ll use three modeling strategies and basic tuning where we can.
<- list(
model_list
glmnet = linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet"),
aorsf = rand_forest(min_n = tune()) %>%
set_engine("aorsf") %>%
set_mode("regression"),
svm = svm_rbf(cost = tune(),
rbf_sigma = tune(),
margin = tune()) %>%
set_mode('regression')
)
Workflows
We’ll start by making a data frame containing all the modeling workflows.
<- metric_set(rsq, rmse)
metrics
<- recipe_list %>%
wf workflow_set(models = model_list) %>%
option_add(metrics = metrics)
wf
Now, how do we make wf
compatible with mirai
? Well, there are probably many ways to do that, but this works fine for me:
<- split(wf, f = seq(nrow(wf)))
wf_split
# check out the first two workflows
1:2] wf_split[
$`1`
# A workflow set/tibble: 1 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 box.yes_pca.yes_poly.yes._glmnet <tibble [1 × 4]> <opts[1]> <list [0]>
$`2`
# A workflow set/tibble: 1 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 box.yes_pca.yes_poly.yes._aorsf <tibble [1 × 4]> <opts[1]> <list [0]>
Why use split
? Because I plan on sending each workflow object into mirai_map
, which can set up asynchronous workers for each item in a list of inputs.
# tell mirai to initiate 10 workers
# (use a suitable number depending on your resources)
daemons(n = 10)
# this seems unnecessary, but bonsai's parsnip
# wrappers will only be necessary in worker's
# environments if we explicitly call library
# in the worker environments.
everywhere(library(bonsai))
# run tune_grid for each workflow, separately
<- mirai_map(
res .x = wf_split,
.f = workflow_map,
.args = list(fn = 'tune_grid',
resamples = data_resamples,
metrics = metrics)
)
So what’s so good about asynchronous parallel programming? Well, while res
is running over 10 workers, you still have access to your current R session. You can even print res
and check it’s progress like so:
res
< mirai map [0/24] >
waits 10 seconds
res
< mirai map [6/24] >
waits some more
res
< mirai map [24/24] >
The lesson here is that mirai_map
can substantially speed up how fast your code will run and also leave you free to keep working while your code runs. That’s two complementary ways to speed up your work. When all the tasks are done, put results back together like so:
# put all the workflows back together.
<- map_dfr(res, 'data')
wf_done
wf_done
Visualize
Now that we’ve covered computing, let’s talk science. We tuned and evaluated three learners paired with eight unique pre-processing recipes. Let’s put together a visual to illustrate how the prediction accuracy of learners varied over these recipes.
<- wf_done %>%
data_gg rank_results("rsq", select_best = TRUE) %>%
filter(.metric == "rsq") %>%
separate(col = wflow_id,
into = c(".preproc", ".model"),
sep = '\\.\\_',
remove = FALSE)
<- data_gg %>%
data_gg arrange(.preproc, desc(rank)) %>%
mutate(text = NA_character_) %>%
split(.$.preproc) %>%
map2_dfr(names(.), ~add_row(.x, text = .y, .before = 1)) %>%
mutate(x = rev(seq(n())))
<- max(data_gg$mean, na.rm = TRUE)
top_score
<- ggplot(data_gg, aes(x = x,
fig y = mean,
ymin = mean - 1.96 * std_err,
ymax = mean + 1.96 * std_err,
fill = .preproc)) +
geom_col(show.legend = FALSE) +
coord_flip() +
geom_text(aes(label = .model, y = mean, hjust = -0.1)) +
geom_text(aes(label = text, y = 0, hjust = 0)) +
geom_hline(yintercept = top_score,
color = 'grey',
linetype = 2) +
theme_minimal() +
scale_fill_viridis_d() +
theme(axis.text.y = element_blank(),
panel.grid = element_blank(),
axis.line.x = element_line()) +
labs(y = "R-squared",
x = "",
fill = "Pre-processing")
fig
From the figure, we see
aorsf
obtained the best overall prediction accuracy (i.e., \(R^2\) statistic in testing data) when it was given the plain, no extra step recipe. Thanks,aorsf
, for making my experiment boring.the
svm
learner had very different prediction accuracy across recipes, and it even beat theaorsf
learner in some recipes where principal component analysis was applied to numeric predictors.glmnet
does pretty well everywhere, but it seems like using polynomials is only beneficial forglmnet
if we also use principal component analysis.
Tabulate
In case you prefer looking at results grouped by model instead of grouped by recipe, here’s a little table that does so:
<- wf_done %>%
data_rsq collect_metrics(summarize = TRUE) %>%
separate(wflow_id,
into = c("box", "pca", "poly", "model"),
sep = '_') %>%
mutate(
poly = str_remove(poly, "\\.$"),
across(c(box, pca, poly), ~ str_remove(.x, ".*\\."))
%>%
) filter(.metric == 'rsq') %>%
group_by(box, pca, poly, model) %>%
arrange(desc(mean)) %>%
slice(1) %>%
select(box:model, mean, std_err) %>%
arrange(model, desc(mean))
library(gt)
gt(data_rsq, groupname_col = 'model') %>%
tab_options(table.width = pct(80)) %>%
tab_spanner(label = "Pre-processing options",
columns = c("box", "pca", "poly")) %>%
tab_spanner(label = "Prediction accuracy",
columns = c("mean", "std_err")) %>%
cols_align('center') %>%
cols_align('left', columns = 'box') %>%
cols_label(box = "Box-cox",
pca = "Principal components",
poly = "Polynomials",
mean = "R-squared",
std_err = "Standard error")
Pre-processing options | Prediction accuracy | |||
---|---|---|---|---|
Box-cox | Principal components | Polynomials | R-squared | Standard error |
aorsf | ||||
no | no | no | 0.9419690 | 0.009308679 |
yes | no | no | 0.9413249 | 0.009697220 |
no | no | yes | 0.9224564 | 0.014067223 |
yes | no | yes | 0.9189269 | 0.013127552 |
yes | yes | no | 0.9186974 | 0.009955760 |
yes | yes | yes | 0.9003848 | 0.013711507 |
no | yes | no | 0.8820053 | 0.015680558 |
no | yes | yes | 0.8608096 | 0.028205456 |
glmnet | ||||
yes | yes | yes | 0.8854123 | 0.015113051 |
no | yes | yes | 0.8846210 | 0.020529055 |
yes | no | no | 0.8834284 | 0.012616637 |
yes | yes | no | 0.8759954 | 0.016490965 |
no | no | no | 0.8693738 | 0.023352998 |
no | yes | no | 0.8641152 | 0.023903995 |
yes | no | yes | 0.8206020 | 0.020318719 |
no | no | yes | 0.8166270 | 0.025821975 |
svm | ||||
yes | yes | yes | 0.9074204 | 0.015053010 |
yes | yes | no | 0.8926469 | 0.027590826 |
no | yes | no | 0.8876902 | 0.024842583 |
no | yes | yes | 0.8039769 | 0.029382297 |
yes | no | yes | 0.7624643 | 0.048258825 |
no | no | yes | 0.7619207 | 0.057404793 |
no | no | no | 0.7514309 | 0.051145069 |
yes | no | no | 0.6936020 | 0.043528774 |
Wrapping up
Use mirai
- it’s a game changer. It works well with tidymodels
and I’m betting it works well with just about any task that can be organized in a list
.