-
Notifications
You must be signed in to change notification settings - Fork 2
/
04_model_building.Rmd
292 lines (213 loc) · 8.95 KB
/
04_model_building.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "04_model_building"
output: html_document
---
layout: false
class: middle, center, inverse
# 4.3 `parsnip`:<br><br>A Common API to Modeling and Analysis Functions
---
background-image: url(https://www.tidymodels.org/images/parsnip.png)
background-position: 97.5% 2.5%
background-size: 7%
layout: true
---
## 4.3 `parsnip`: A Unified Modeling API
**Different models, different packages**
The `R` ecosystem offers a plethora of different packages for implementing machine learning models: `stats::lm`, `stats::glm`, `MASS::lda`, `class::knn`, `glmnet::glmnet`, `rpart::rpart`, `randomForest::randomForest`, `gbm::gbm`, `e1071::svm`, etc.
It is very likely that you will struggle with the varying naming conventions, function interfaces and syntactical intricacies of each package.
--
```{r, echo=F, out.width='35%', out.extra='style="float:right; padding=10px"'}
knitr::include_graphics("https://tenor.com/view/ballin-juggling-talent-juggle-wow-gif-16262578.gif")
```
**Same models, different packages**
The same issue persists if you try to implement one and the same model using alternative packages.
.panelset[
.panel[.panel-name[randomForest]
- **Number of predictors:** mtry
- **Number of trees:** ntree
- **Number of split points:** nodesize
]
.panel[.panel-name[ranger]
- **Number of predictors:** mtry
- **Number of trees:** num.trees
- **Number of split points:** min.node.size
]
.panel[.panel-name[sparklyr]
- **Number of predictors:** feature_subset_strategy
- **Number of trees:** num_trees
- **Number of split points:** min_instances_per_node
]
]
???
- note: this heterogeneity can be observed across the whole modeling landscape in R (e.g., also each package has its own `predict()` functions with slightly differing naming conventions)
---
## 4.3 `parsnip`: A Unified Modeling API
```{r, echo=F, out.width='50%', out.extra='style="float:right; padding=10px"'}
knitr::include_graphics("https://tenor.com/view/balls-rolling-racing-rolling-on-ball-yoga-balls-gif-15365855.gif")
```
`parsnip` provides a unified interface and syntax to modeling which facilitates your overall modeling workflow. The goals of `parsnip` are twofold:
1. Decoupling model definition from model fitting and model evaluation<br><br>
2. Harmonizing function arguments (e.g., `ntree`, `num.trees` and `num_trees` become `trees` or `k` becomes `neighbors`)
???
- the goal is to make function arguments more expressive (`neighbor` instead of `k`, `penalty` instead of `lambda`)
- in `parsnip`: `trees`
---
## 4.3 `parsnip`: A Unified Modeling API
```{r, echo=F, out.height='60%', out.width='60%', out.extra='style="float:right; padding:10px"'}
knitr::include_graphics("https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/parsnip.png")
```
A `parsnip` model specification consists of three individual components:
- **Type:** The model type that is about to be fitted (e.g., linear/logit regression, random forest or SVM).<br><br>
- **Mode:** The mode of prediction, i.e. regression or classification.<br><br>
- **Engine:** The computational engine implemented in `R` which usually corresponds to a certain modeling function (`lm`, `glm`), package (e.g., `rpart`, `glmnet`, `randomForest`) or computing framework (e.g., `Stan`, `sparklyr`).
.footnote[
*Note: Check all models and engines supported by `parsnip` on the [`tidymodels` website](https://www.tidymodels.org/find/parsnip/) or using the RStudio Addin.*
]
---
## 4.3 `parsnip`: A Unified Modeling API
**Logistic classifier:**
```{r}
log_cls <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# equivalent: logistic_reg(mode = "classification", engine = "glm")
log_cls
```
???
- note that some model families support both modes, some only one of the two (e.g., LDA only for classification, ARIMA models only for regressions)
- note that we did not reference the data in any way so far (variable roles are entirely specified by our recipe)
- also we did not yet train or validate our model, we just define it
---
## 4.3 `parsnip`: A Unified Modeling API
**Regularized logistic classifier:**
```{r}
lasso_cls <- logistic_reg() %>%
set_args(penalty = 0.1, mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet", family = "binomial")
lasso_cls
```
.footnote[
_Note: `parsnip` distinguishes between **model arguments** and **engine arguments**. The former reflect hyperparameters that are frequently used across various model packages (i.e. engines) whereas the latter reflect arguments that are usually engine-specific. Model arguments are harmonized across modeling packages whereas engine arguments are not._
]
???
- the function arguments could also be specified directly in the model function, but this way it is more transparent and sequential
- mixture reflects the amount of the l1 respectively l2 penalty
---
## 4.3 `parsnip`: A Unified Modeling API
**Decision tree classifier:**
```{r}
dt_cls <- decision_tree() %>%
set_args(cost_complexity = 0.01, tree_depth = 30, min_n = 20) %>%
set_mode("classification") %>%
set_engine("rpart")
dt_cls
```
.footnote[
*Note: If not explicitly specified, `parsnip` adopts the model's default parameters (i.e. function arguments) defined by the underlying engine (here `rpart`).*
]
---
## 4.3 `parsnip`: A Unified Modeling API
**Tree bagging classifier:**
```{r}
rand_forest() %>%
set_args(trees = 1000, mtry = .cols()) %>%
set_mode("classification") %>%
set_engine("randomForest")
```
.footnote[
*Note: Use data set characteristics as placeholder arguments which reflect the number of predictors in your data set. `.preds()` and `.cols()` capture the number of predictors in your data prior respectively subsequent to preprocessing (e.g., one-hot encoding).*
]
---
## 4.3 `parsnip`: A Unified Modeling API
**Random forest classifier:**
```{r}
rand_forest() %>%
set_args(trees = 1000, mtry = floor(sqrt(.cols()))) %>%
set_mode("classification") %>%
set_engine("randomForest")
```
.footnote[
*Note: Generally, the square root of the number of available predictors is a good starting point for `mtry`. From there on, you could double or half the number of predictors sampled at each split.*
]
---
## 4.3 `parsnip`: A Unified Modeling API
**k-nearest-neighbor classifier:**
```{r}
nearest_neighbor() %>%
set_args(neighbors = 5, dist_power = 2) %>%
set_mode("classification") %>%
set_engine("kknn")
```
???
- dist_power: 1 (manhattan), 2 (euclidean)
---
## 4.3 `parsnip`: A Unified Modeling API
**SVM classifier:**
```{r}
svm_rbf() %>%
set_args(cost = tune(), rbf_sigma = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
```
.footnote[
*Note: Use the `tune()` placeholder as a model argument when the parameter is supposed to be specified later on in the workflow (e.g., during hyperparameter tuning).*
]
---
## 4.3 `parsnip`: A Unified Modeling API
Finally, it is time to train our specified model! Since some modeling functions require a formula (e.g., `lm()`) as input and others a vector, a matrix (e.g., `glmnet()`) or a data frame, `parsnip` offers two modes for fitting.
.panelset[
.panel[.panel-name[Formula interface]
```{r, results='hide'}
dt_cls_fit <- dt_cls %>%
fit(formula = died ~ ., data = train_set)
dt_cls_fit
```
]
.panel[.panel-name[Matrix interface]
```{r, eval=F}
dt_cls_fit <- dt_cls %>%
fit_xy(x = train_set %>% select(-died), y = train_set$died)
dt_cls_fit
```
]
.panel[.panel-name[Translate]
```{r}
dt_cls_fit$spec %>%
translate
```
]
.panel[.panel-name[A Warning]
<br>
`r emo::ji("warning")` **Notice that we did not apply any of our predefined preprocessing steps yet!** `r emo::ji("warning")`
- The code will throw an error if we try to fit any of our logit models due to the absence of dummies.
- The Lasso model would likely perform poorly due to the differently scaled predictors.
- Our models will likely always predict the negative class due to the severe class imbalance.
]
]
.footnote[
*Note: Only the formula notation automatically creates dummies whereas `fit_xy()` takes the data as-is.*
]
???
- Apply `translate()` to investigate how `parsnip` translates the specification into the underlying computational engine.
---
## 4.3 `parsnip`: A Unified Modeling API
After fitting the model, we can eventually predict the response in the test data.
```{r}
dt_cls_fit %>%
predict(new_data = test_set, type = "prob") %>%
glimpse
```
--
**`tidymodels` prediction rules:**
1. Predictions are returned as a `tibble` (no need to extract predictions from an object).<br><br>
2. Column names are predictable (`.pred`, `.pred_class`, `.pred_lower`/`.pred_upper`, etc. depending on the prediction `type`).<br><br>
3. The number of predictions equals the number of data points in `new_data` (and is in the same order).
???
- leading dots protect against merging errors based on identical column names
---
## 4.3 `parsnip`: A Unified Modeling API
Thanks to these rules, we can directly combine the predictions with the `test_set`.
```{r}
test_set %>% dplyr::bind_cols(predict(dt_cls_fit, new_data = ., type = "prob"))
```