Tech news 7: tidymodels
This week’s email is about tidymodels
.
The
tidymodels
framework is a collection of packages for modeling and machine learning usingtidyverse
principles.
Tidymodels
offers a consistent and flexible framework for your data science and data programming needs. This tool suite is designed to streamline and simplify the process of building and tuning models, making it easier for researchers to extract meaningful insights from their data.
To illustrate the use of key packages in tidymodels
, here’s a short example workflow:
Workflow (
The workflow package enables you to assemble the individual components of your modeling process into a cohesive workflow. You can define the order of operations and specify the preprocessing steps, model fitting, and model evaluation) {
Recipes (
The recipes package helps with data preprocessing and feature engineering. You can create a recipe to define the steps for encoding categorical variables, scaling numeric variables, and creating new features) %>%
Dials (
The dials package provides a way to define tuning parameters for models. You can create a set of possible values for each parameter using the grid_*()
functions) %>%
Parsnip (
The parsnip package allows you to specify the type of model you want to build. For example, you can create a linear regression model using linear_reg()
function) %>%
Tune (
The tune package provides functions for hyperparameter tuning. You can use the ’tune_*()’ functions to automatically search for the best combination of hyperparameters for your model using a specified tuning grid)
}
By utilising these packages in sequence, you can build a comprehensive modeling pipeline that includes specifying the model type, preprocessing data, tuning hyperparameters, and evaluating model performance.
Here is a code example from the tidymodels “Get started” page:
<- # Loading the data
urchins ::read_csv("https://tidymodels.org/start/models/urchins.csv") %>%
readr::setNames(c("food_regime", "initial_volume", "width")) %>%
stats::mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))
dplyr
# Then preparing the model: a linear regression
::linear_reg() %>% # This is the model type
parsnip::set_engine("keras") # This is the engine / fitting method
parsnip<- parsnip::linear_reg() # The default engine: “lm” is used
lm_mod
# And now we fit the model:
<-
lm_fit %>%
lm_mod fit(width ~ initial_volume * food_regime, data = urchins)
lm_fit#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = width ~ initial_volume * food_regime, data = data)
#>
#> Coefficients:
#> (Intercept) initial_volume
#> 0.0331216 0.0015546
#> food_regimeLow food_regimeHigh
#> 0.0197824 0.0214111
#> initial_volume:food_regimeLow initial_volume:food_regimeHigh
#> -0.0012594 0.0005254
# broom::tidy() offers a clean output:
::tidy(lm_fit)
broom#> # A tibble: 6 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.0331 0.00962 3.44 0.00100
#> 2 initial_volume 0.00155 0.000398 3.91 0.000222
#> 3 food_regimeLow 0.0198 0.0130 1.52 0.133
#> 4 food_regimeHigh 0.0214 0.0145 1.47 0.145
#> 5 initial_volume:food_regimeLow -0.00126 0.000510 -2.47 0.0162
#> 6 initial_volume:food_regimeHigh 0.000525 0.000702 0.748 0.457
Now what if we wanted something different like a Bayesian model? No need to change everything, tidymodels functions are generic even though the underlying packages use very different syntax:
# set the prior distribution
<- rstanarm::student_t(df = 1)
prior_dist
# make the parsnip model
<-
bayes_mod ::linear_reg() %>%
parsnip::set_engine("stan",
parsnipprior_intercept = prior_dist,
prior = prior_dist)
# train the model
<-
bayes_fit %>%
bayes_mod fit(width ~ initial_volume * food_regime, data = urchins)
:
Easily create training and testing data sets with rsamples<- rsamples::initial_split(urchins %>% select(-width),
urchins_split strata = class)
<- rsamples::training(urchins_split)
urchins_train <- rsamples::testing(urchins_split)
urchins_test for even more reproducibility and reliability:
Create a Workflow <-
urchins_wflow ::workflow() %>%
workflow::add_model(lm_mod) %>%
workflow::add_recipe(urchins_rec)
workflow:
Fit the model on the training data<-
urchins_fit %>%
urchins_wflow fit(data = train_data)
!
And easily fit and predict on the test datapredict(urchins_fit, test_data)
I hope this triggered your curiosity and you see potential for an increase in efficiency and reliability in your analytical work!
I relied heavily on the tidymodels
“Get started” page to prepare this email and I encourage anyone interested to consult it too or even follow the introductory or advanced workshops the authors published online!
Resources
The tidyverse blog is a great resource for staying up to date with the latest tidymodels news and developments:
- Tidymodels website: https://www.tidymodels.org/
- Tidymodels get started: https://www.tidymodels.org/start/
- Tidymodels workshops: https://workshops.tidymodels.org/
Happy to discuss it and collaborate!
Best wishes,
Alban