Tech news 7: tidymodels

experience
R
package
technews
Published

January 26, 2024

This week’s email is about tidymodels.

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

Tidymodels offers a consistent and flexible framework for your data science and data programming needs. This tool suite is designed to streamline and simplify the process of building and tuning models, making it easier for researchers to extract meaningful insights from their data.

To illustrate the use of key packages in tidymodels, here’s a short example workflow:

Workflow (The workflow package enables you to assemble the individual components of your modeling process into a cohesive workflow. You can define the order of operations and specify the preprocessing steps, model fitting, and model evaluation) {

Recipes (The recipes package helps with data preprocessing and feature engineering. You can create a recipe to define the steps for encoding categorical variables, scaling numeric variables, and creating new features) %>%

Dials (The dials package provides a way to define tuning parameters for models. You can create a set of possible values for each parameter using the grid_*() functions) %>%

Parsnip (The parsnip package allows you to specify the type of model you want to build. For example, you can create a linear regression model using linear_reg() function) %>%

Tune (The tune package provides functions for hyperparameter tuning. You can use the ’tune_*()’ functions to automatically search for the best combination of hyperparameters for your model using a specified tuning grid) }

By utilising these packages in sequence, you can build a comprehensive modeling pipeline that includes specifying the model type, preprocessing data, tuning hyperparameters, and evaluating model performance.

Here is a code example from the tidymodels “Get started” page:

urchins <- # Loading the data
  readr::read_csv("https://tidymodels.org/start/models/urchins.csv") %>% 
  stats::setNames(c("food_regime", "initial_volume", "width")) %>% 
  dplyr::mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))

# Then preparing the model: a linear regression
parsnip::linear_reg() %>% # This is the model type
  parsnip::set_engine("keras")  # This is the engine / fitting method
lm_mod <- parsnip::linear_reg() # The default engine: “lm” is used

# And now we fit the model:
lm_fit <- 
  lm_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)
lm_fit
#> parsnip model object
#> 
#> 
#> Call:
#> stats::lm(formula = width ~ initial_volume * food_regime, data = data)
#> 
#> Coefficients:
#>                    (Intercept)                  initial_volume  
#>                      0.0331216                       0.0015546  
#>                 food_regimeLow                 food_regimeHigh  
#>                      0.0197824                       0.0214111  
#>  initial_volume:food_regimeLow  initial_volume:food_regimeHigh  
#>                     -0.0012594                       0.0005254

# broom::tidy() offers a clean output:
broom::tidy(lm_fit)
#> # A tibble: 6 × 5
#>   term                                            estimate std.error statistic  p.value
#>   <chr>                                          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)                                   0.0331     0.00962      3.44  0.00100 
#> 2 initial_volume                                0.00155   0.000398      3.91  0.000222
#> 3 food_regimeLow                                0.0198      0.0130      1.52  0.133   
#> 4 food_regimeHigh                               0.0214      0.0145      1.47  0.145   
#> 5 initial_volume:food_regimeLow                -0.00126   0.000510     -2.47  0.0162  
#> 6 initial_volume:food_regimeHigh                0.000525  0.000702      0.748 0.457

Now what if we wanted something different like a Bayesian model? No need to change everything, tidymodels functions are generic even though the underlying packages use very different syntax:

# set the prior distribution
prior_dist <- rstanarm::student_t(df = 1)

# make the parsnip model
bayes_mod <-   
  parsnip::linear_reg() %>% 
  parsnip::set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist) 

# train the model
bayes_fit <- 
  bayes_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)

Easily create training and testing data sets with rsamples:
urchins_split <- rsamples::initial_split(urchins %>% select(-width), 
                            strata = class)
urchins_train <- rsamples::training(urchins_split)
urchins_test  <- rsamples::testing(urchins_split)
Create a Workflow for even more reproducibility and reliability:
urchins_wflow <- 
  workflow::workflow() %>% 
  workflow::add_model(lm_mod) %>% 
  workflow::add_recipe(urchins_rec)
Fit the model on the training data:
urchins_fit <- 
  urchins_wflow %>% 
  fit(data = train_data)
And easily fit and predict on the test data!
predict(urchins_fit, test_data)

I hope this triggered your curiosity and you see potential for an increase in efficiency and reliability in your analytical work!

I relied heavily on the tidymodels “Get started” page to prepare this email and I encourage anyone interested to consult it too or even follow the introductory or advanced workshops the authors published online!

Resources

The tidyverse blog is a great resource for staying up to date with the latest tidymodels news and developments:

Happy to discuss it and collaborate!

Best wishes,

Alban