04_model_building.Rmd

---
title: "04_model_building"
output: html_document
---

layout: false
class: middle, center, inverse

# 4.3 `parsnip`:<br><br>A Common API to Modeling and Analysis Functions

---

background-image: url(https://www.tidymodels.org/images/parsnip.png)
background-position: 97.5% 2.5%
background-size: 7%
layout: true

---

## 4.3 `parsnip`: A Unified Modeling API

**Different models, different packages**

The `R` ecosystem offers a plethora of different packages for implementing machine learning models: `stats::lm`, `stats::glm`, `MASS::lda`, `class::knn`, `glmnet::glmnet`, `rpart::rpart`, `randomForest::randomForest`, `gbm::gbm`, `e1071::svm`, etc.

It is very likely that you will struggle with the varying naming conventions, function interfaces and syntactical intricacies of each package.

--

```{r, echo=F, out.width='35%', out.extra='style="float:right; padding=10px"'}
knitr::include_graphics("https://tenor.com/view/ballin-juggling-talent-juggle-wow-gif-16262578.gif")
```

**Same models, different packages**

The same issue persists if you try to implement one and the same model using alternative packages.

.panelset[
.panel[.panel-name[randomForest]
- **Number of predictors:** mtry
- **Number of trees:** ntree
- **Number of split points:** nodesize
]
.panel[.panel-name[ranger]
- **Number of predictors:** mtry
- **Number of trees:** num.trees
- **Number of split points:** min.node.size
]
.panel[.panel-name[sparklyr]
- **Number of predictors:** feature_subset_strategy
- **Number of trees:** num_trees
- **Number of split points:** min_instances_per_node
]
]

???
- note: this heterogeneity can be observed across the whole modeling landscape in R (e.g., also each package has its own `predict()` functions with slightly differing naming conventions)

---

## 4.3 `parsnip`: A Unified Modeling API

```{r, echo=F, out.width='50%', out.extra='style="float:right; padding=10px"'}
knitr::include_graphics("https://tenor.com/view/balls-rolling-racing-rolling-on-ball-yoga-balls-gif-15365855.gif")
```

`parsnip` provides a unified interface and syntax to modeling which facilitates your overall modeling workflow. The goals of `parsnip` are twofold:
1. Decoupling model definition from model fitting and model evaluation<br><br>
2. Harmonizing function arguments (e.g., `ntree`, `num.trees` and `num_trees` become `trees` or `k` becomes `neighbors`)


???
- the goal is to make function arguments more expressive (`neighbor` instead of `k`, `penalty` instead of `lambda`)
- in `parsnip`: `trees`

---

## 4.3 `parsnip`: A Unified Modeling API

```{r, echo=F, out.height='60%', out.width='60%', out.extra='style="float:right; padding:10px"'}
knitr::include_graphics("https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/parsnip.png")
```

A `parsnip` model specification consists of three individual components:
- **Type:** The model type that is about to be fitted (e.g., linear/logit regression, random forest or SVM).<br><br>
- **Mode:** The mode of prediction, i.e. regression or classification.<br><br>
- **Engine:** The computational engine implemented in `R` which usually corresponds to a certain modeling function (`lm`, `glm`), package (e.g., `rpart`, `glmnet`, `randomForest`) or computing framework (e.g., `Stan`, `sparklyr`).

.footnote[
*Note: Check all models and engines supported by `parsnip` on the [`tidymodels` website](https://www.tidymodels.org/find/parsnip/) or using the RStudio Addin.*
]

---

## 4.3 `parsnip`: A Unified Modeling API

**Logistic classifier:**
```{r}
log_cls <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# equivalent: logistic_reg(mode = "classification", engine = "glm")

log_cls
```

???
- note that some model families support both modes, some only one of the two (e.g., LDA only for classification, ARIMA models only for regressions)
- note that we did not reference the data in any way so far (variable roles are entirely specified by our recipe)
- also we did not yet train or validate our model, we just define it

---

## 4.3 `parsnip`: A Unified Modeling API

**Regularized logistic classifier:**
```{r}
lasso_cls <- logistic_reg() %>%
  set_args(penalty = 0.1, mixture = 1) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet", family = "binomial")

lasso_cls
```

.footnote[
_Note: `parsnip` distinguishes between **model arguments** and **engine arguments**. The former reflect hyperparameters that are frequently used across various model packages (i.e. engines) whereas the latter reflect arguments that are usually engine-specific. Model arguments are harmonized across modeling packages whereas engine arguments are not._
]

???
- the function arguments could also be specified directly in the model function, but this way it is more transparent and sequential
- mixture reflects the amount of the l1 respectively l2 penalty

---

## 4.3 `parsnip`: A Unified Modeling API

**Decision tree classifier:**
```{r}
dt_cls <- decision_tree() %>% 
  set_args(cost_complexity = 0.01, tree_depth = 30, min_n = 20) %>% 
  set_mode("classification") %>% 
  set_engine("rpart")

dt_cls
```

.footnote[
*Note: If not explicitly specified, `parsnip` adopts the model's default parameters (i.e. function arguments) defined by the underlying engine (here `rpart`).*
]

---

## 4.3 `parsnip`: A Unified Modeling API

**Tree bagging classifier:**
```{r}
rand_forest() %>% 
  set_args(trees = 1000, mtry = .cols()) %>% 
  set_mode("classification") %>% 
  set_engine("randomForest")
```

.footnote[
*Note: Use data set characteristics as placeholder arguments which reflect the number of predictors in your data set. `.preds()` and `.cols()` capture the number of predictors in your data prior respectively subsequent to preprocessing (e.g., one-hot encoding).*
]

---

## 4.3 `parsnip`: A Unified Modeling API

**Random forest classifier:**
```{r}
rand_forest() %>%
  set_args(trees = 1000, mtry = floor(sqrt(.cols()))) %>% 
  set_mode("classification") %>% 
  set_engine("randomForest")
```

.footnote[
*Note: Generally, the square root of the number of available predictors is a good starting point for `mtry`. From there on, you could double or half the number of predictors sampled at each split.*
]

---

## 4.3 `parsnip`: A Unified Modeling API

**k-nearest-neighbor classifier:**
```{r}
nearest_neighbor() %>% 
  set_args(neighbors = 5, dist_power = 2) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")
```

???
- dist_power: 1 (manhattan), 2 (euclidean)

---

## 4.3 `parsnip`: A Unified Modeling API

**SVM classifier:**
```{r}
svm_rbf() %>% 
  set_args(cost = tune(), rbf_sigma = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kernlab")
```

.footnote[
*Note: Use the `tune()` placeholder as a model argument when the parameter is supposed to be specified later on in the workflow (e.g., during hyperparameter tuning).*
]

---

## 4.3 `parsnip`: A Unified Modeling API

Finally, it is time to train our specified model! Since some modeling functions require a formula (e.g., `lm()`) as input and others a vector, a matrix (e.g., `glmnet()`) or a data frame, `parsnip` offers two modes for fitting.

.panelset[
.panel[.panel-name[Formula interface]
```{r, results='hide'}
dt_cls_fit <- dt_cls %>%
  fit(formula = died ~ ., data = train_set)

dt_cls_fit
```
]
.panel[.panel-name[Matrix interface]
```{r, eval=F}
dt_cls_fit <- dt_cls %>%
  fit_xy(x = train_set %>% select(-died), y = train_set$died)

dt_cls_fit
```
]
.panel[.panel-name[Translate]
```{r}
dt_cls_fit$spec %>% 
  translate
```
]
.panel[.panel-name[A Warning]
<br>
`r emo::ji("warning")` **Notice that we did not apply any of our predefined preprocessing steps yet!** `r emo::ji("warning")`

- The code will throw an error if we try to fit any of our logit models due to the absence of dummies.  
- The Lasso model would likely perform poorly due to the differently scaled predictors.  
- Our models will likely always predict the negative class due to the severe class imbalance.
]
]

.footnote[
*Note: Only the formula notation automatically creates dummies whereas `fit_xy()` takes the data as-is.*
]

???
- Apply `translate()` to investigate how `parsnip` translates the specification into the underlying computational engine.

---

## 4.3 `parsnip`: A Unified Modeling API

After fitting the model, we can eventually predict the response in the test data.
```{r}
dt_cls_fit %>% 
  predict(new_data = test_set, type = "prob") %>% 
  glimpse
```

--


**`tidymodels` prediction rules:**
1. Predictions are returned as a `tibble` (no need to extract predictions from an object).<br><br>
2. Column names are predictable (`.pred`, `.pred_class`, `.pred_lower`/`.pred_upper`, etc. depending on the prediction `type`).<br><br>
3. The number of predictions equals the number of data points in `new_data` (and is in the same order).

???
- leading dots protect against merging errors based on identical column names

---

## 4.3 `parsnip`: A Unified Modeling API

Thanks to these rules, we can directly combine the predictions with the `test_set`.
```{r}
test_set %>% dplyr::bind_cols(predict(dt_cls_fit, new_data = ., type = "prob"))
```