-
Notifications
You must be signed in to change notification settings - Fork 292
/
Copy path13-grid-search.Rmd
956 lines (717 loc) · 48.3 KB
/
13-grid-search.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
```{r grid-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(finetune)
library(ggforce)
library(stringr)
library(av)
library(lme4)
data(cells)
theme_set(theme_bw())
library(doMC)
registerDoMC(cores = parallel::detectCores(logical = TRUE))
library(kableExtra)
tidymodels_prefer()
## -----------------------------------------------------------------------------
load("RData/mlp_times.RData")
```
# Grid Search {#grid-search}
In Chapter \@ref(tuning) we demonstrated how users can mark or tag arguments in preprocessing recipes and/or model specifications for optimization using the `tune()` function. Once we know what to optimize, it's time to address the question of how to optimize the parameters. This chapter describes *grid search* methods that specify the possible values of the parameters _a priori_. (Chapter \@ref(iterative-search) will continue the discussion by describing iterative search methods.)
Let's start by looking at two main approaches for assembling a grid.
## Regular and Nonregular Grids {#grids}
There are two main types of grids. A regular grid combines each parameter (with its corresponding set of possible values) factorially, i.e., by using all combinations of the sets. Alternatively, a nonregular grid is one where the parameter combinations are not formed from a small set of points.
Before we look at each type in more detail, let's consider an example model: the multilayer perceptron model (a.k.a. single layer artificial neural network). The parameters marked for tuning are:
* the number of hidden units
* the number of fitting epochs/iterations in model training
* the amount of weight decay penalization
:::rmdnote
Historically, the number of epochs was determined by early stopping; a separate validation set determined the length of training based on the error rate, since re-predicting the training set led to overfitting. In our case, the use of a weight decay penalty should prohibit overfitting, and there is little harm in tuning the penalty and the number of epochs.
:::
Using `r pkg(parsnip)`, the specification for a classification model fit using the `r pkg(nnet)` package is:
```{r grid-mlp}
library(tidymodels)
tidymodels_prefer()
mlp_spec <-
mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>%
set_engine("nnet", trace = 0) %>%
set_mode("classification")
```
The argument `trace = 0` prevents extra logging of the training process. As shown in Section \@ref(tuning-params-tidymodels), the `extract_parameter_set_dials()` function can extract the set of arguments with unknown values and sets their `r pkg(dials)` objects:
```{r grid-mlp-param}
mlp_param <- extract_parameter_set_dials(mlp_spec)
mlp_param %>% extract_parameter_dials("hidden_units")
mlp_param %>% extract_parameter_dials("penalty")
mlp_param %>% extract_parameter_dials("epochs")
```
This output indicates that the parameter objects are complete and prints their default ranges. These values will be used to demonstrate how to create different types of parameter grids.
### Regular grids {-}
Regular grids are combinations of separate sets of parameter values. First, the user creates a distinct set of values for each parameter. The number of possible values need not be the same for each parameter. The `r pkg(tidyr)` function `crossing()` is one way to create a regular grid:
```{r grid-crossing}
crossing(
hidden_units = 1:3,
penalty = c(0.0, 0.1),
epochs = c(100, 200)
)
```
The parameter object knows the ranges of the parameters. The `r pkg(dials)` package contains a set of `grid_*()` functions that take the parameter object as input to produce different types of grids. For example:
```{r grid-reg}
grid_regular(mlp_param, levels = 2)
```
The `levels` argument is the number of levels per parameter to create. It can also take a named vector of values:
```{r grid-reg-lvls}
mlp_param %>%
grid_regular(levels = c(hidden_units = 3, penalty = 2, epochs = 2))
```
There are techniques for creating regular grids that do not use all possible values of each parameter set. These _fractional factorial designs_ [@BHH] could also be used. To learn more, consult the CRAN Task View for experimental design.^[<https://CRAN.R-project.org/view=ExperimentalDesign>]
:::rmdwarning
Regular grids can be computationally expensive to use, especially when there are a medium-to-large number of tuning parameters. This is true for many models but not all. As discussed in Section \@ref(efficient-grids) below, there are many models whose tuning time _decreases_ with a regular grid!
:::
One advantage to using a regular grid is that the relationships and patterns between the tuning parameters and the model metrics are easily understood. The factorial nature of these designs allows for examination of each parameter separately with little confounding between parameters.
### Irregular grids {-}
There are several options for creating non-regular grids. The first is to use random sampling across the range of parameters. The `grid_random()` function generates independent uniform random numbers across the parameter ranges. If the parameter object has an associated transformation (such as we have for `penalty`), the random numbers are generated on the transformed scale. Let's create a random grid for the parameters from our example neural network:
```{r grid-rand}
set.seed(1301)
mlp_param %>%
grid_random(size = 1000) %>% # 'size' is the number of combinations
summary()
```
For `penalty`, the random numbers are uniform on the log (base-10) scale but the values in the grid are in the natural units.
The issue with random grids is that, with small-to-medium grids, random values can result in overlapping parameter combinations. Also, the random grid needs to cover the whole parameter space, but the likelihood of good coverage increases with the number of grid values. Even for a sample of 15 candidate points, Figure \@ref(fig:random-grid) shows some overlap between points for our example multilayer perceptron.
```{r grid-random-matrix, eval = FALSE}
library(ggforce)
set.seed(1302)
mlp_param %>%
# The 'original = FALSE' option keeps penalty in log10 units
grid_random(size = 20, original = FALSE) %>%
ggplot(aes(x = .panel_x, y = .panel_y)) +
geom_point() +
geom_blank() +
facet_matrix(vars(hidden_units, penalty, epochs), layer.diag = 2) +
labs(title = "Random design with 20 candidates")
```
```{r random-grid, ref.label = "grid-random-matrix"}
#| fig.height = 6,
#| fig.width = 6,
#| echo = FALSE,
#| fig.cap = "Three tuning parameters with 15 points generated at random",
#| fig.alt = "A scatter plot matrix for three tuning parameters with 20 points generated at random. There are significant gaps in the parameter space."
```
A much better approach is to use a set of experimental designs called _space-filling designs_. While different design methods have slightly different goals, they generally find a configuration of points that cover the parameter space with the smallest chance of overlapping or redundant values. Examples of such designs are Latin hypercubes [@lhd], maximum entropy designs [@maxent], maximum projection designs [@maxproj], and others. See @santner2003design for an overview.
The `r pkg(dials)` package contains functions for Latin hypercube and maximum entropy designs. As with `grid_random()`, the primary inputs are the number of parameter combinations and a parameter object. Let's compare a random design with a Latin hypercube design for 20 candidate parameter values in Figure \@ref(fig:space-filling-design).
```{r grid-sfd-compare, eval = FALSE}
set.seed(1303)
mlp_param %>%
grid_latin_hypercube(size = 20, original = FALSE) %>%
ggplot(aes(x = .panel_x, y = .panel_y)) +
geom_point() +
geom_blank() +
facet_matrix(vars(hidden_units, penalty, epochs), layer.diag = 2) +
labs(title = "Latin Hypercube design with 20 candidates")
```
```{r space-filling-design, ref.label = "grid-sfd-compare"}
#| fig.height = 6,
#| fig.width = 6,
#| echo = FALSE,
#| fig.cap = "Three tuning parameters with 20 points generated using a space-filling design",
#| fig.alt = "A scatter plot matrix for three tuning parameters with 15 points generated using a space-filling design. There are fewer gaps in the parameter space when compared to the random grid."
```
While not perfect, this Latin hypercube design spaces the points farther away from one another and allows a better exploration of the hyperparameter space.
Space-filling designs can be very effective at representing the parameter space. The default design used by the `r pkg(tune)` package is the maximum entropy design. These tend to produce grids that cover the candidate space well and drastically increase the chances of finding good results.
## Evaluating the Grid {#evaluating-grid}
To choose the best tuning parameter combination, each candidate set is assessed using data that were not used to train that model. Resampling methods or a single validation set work well for this purpose. The process (and syntax) closely resembles the approach in Section \@ref(resampling-performance) that used the `fit_resamples()` function from the `r pkg(tune)` package.
After resampling, the user selects the most appropriate candidate parameter set. It might make sense to choose the empirically best parameter combination or bias the choice towards other aspects of the model fit, such as simplicity.
We use a classification data set to demonstrate model tuning in this and the next chapter. The data come from @Hill, who developed an automated microscopy laboratory tool for cancer research. The data consists of 56 imaging measurements on 2019 human breast cancer cells. These predictors represent shape and intensity characteristics of different parts of the cells (e.g., the nucleus, the cell boundary, etc.). There is a high degree of correlation between the predictors. For example, there are several different predictors that measure the size and shape of the nucleus and cell boundary. Also, individually, many predictors have skewed distributions.
Each cell belongs to one of two classes. Since this is part of an automated lab test, the focus was on prediction capability rather than inference.
The data are included in the `r pkg(modeldata)` package. Let's remove one column not needed for analysis (`case`):
```{r grid-cells}
library(tidymodels)
data(cells)
cells <- cells %>% select(-case)
```
Given the dimensions of the data, we can compute performance metrics using 10-fold cross-validation:
```{r grid-cells-folds}
set.seed(1304)
cell_folds <- vfold_cv(cells)
```
Because of the high degree of correlation between predictors, it makes sense to use PCA feature extraction to decorrelate the predictors. The following recipe contains steps to transform the predictors to increase symmetry, normalize them to be on the same scale, then conduct feature extraction. The number of PCA components to retain is also tuned, along with the model parameters.
:::rmdwarning
While the resulting PCA components are technically on the same scale, the lower-rank components tend to have a wider range than the higher-rank components. For this reason, we normalize again to coerce the predictors to have the same mean and variance.
:::
Many of the predictors have skewed distributions. Since PCA is variance based, extreme values can have a detrimental effect on these calculations. To counter this, let's add a recipe step estimating a Yeo-Johnson transformation for each predictor [@yeo2000new]. While originally intended as a transformation of the outcome, it can also be used to estimate transformations that encourage more symmetric distributions. This step `step_YeoJohnson()` occurs in the recipe just prior to the initial normalization via `step_normalize()`. Then, let's combine this feature engineering recipe with our neural network model specification `mlp_spec`.
```{r grid-cells-objects}
mlp_rec <-
recipe(class ~ ., data = cells) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), num_comp = tune()) %>%
step_normalize(all_numeric_predictors())
mlp_wflow <-
workflow() %>%
add_model(mlp_spec) %>%
add_recipe(mlp_rec)
```
Let's create a parameter object `mlp_param` to adjust a few of the default ranges. We can change the number of epochs to have a smaller range (50 to 200 epochs). Also, the default range for `num_comp()` defaults to a very narrow range (one to four components); we can increase the range to 40 components and set the minimum value to zero:
```{r grid-cells-workflow}
mlp_param <-
mlp_wflow %>%
extract_parameter_set_dials() %>%
update(
epochs = epochs(c(50, 200)),
num_comp = num_comp(c(0, 40))
)
```
:::rmdnote
In `step_pca()`, using zero PCA components is a shortcut to skip the feature extraction. In this way, the original predictors can be directly compared to the results that include PCA components.
:::
The `tune_grid()` function is the primary function for conducting grid search. Its functionality is very similar to `fit_resamples()` from Section \@ref(resampling-performance), although it has additional arguments related to the grid:
* `grid`: An integer or data frame. When an integer is used, the function creates a space-filling design with `grid` number of candidate parameter combinations. If specific parameter combinations exist, the `grid` parameter is used to pass them to the function.
* `param_info`: An optional argument for defining the parameter ranges. The argument is most useful when `grid` is an integer.
Otherwise, the interface to `tune_grid()` is the same as `fit_resamples()`. The first argument is either a model specification or workflow. When a model is given, the second argument can be either a recipe or formula. The other required argument is an `r pkg(rsample)` resampling object (such as `cell_folds`). The following call also passes a metric set so that the area under the ROC curve is measured during resampling.
To start, let's evaluate a regular grid with three levels across the resamples:
```{r grid-cells-regular}
roc_res <- metric_set(roc_auc)
set.seed(1305)
mlp_reg_tune <-
mlp_wflow %>%
tune_grid(
cell_folds,
grid = mlp_param %>% grid_regular(levels = 3),
metrics = roc_res
)
mlp_reg_tune
```
There are high-level convenience functions we can use to understand the results. First, the `autoplot()` method for regular grids shows the performance profiles across tuning parameters in Figure \@ref(fig:regular-grid-plot).
```{r grid-cells-reg-plot, eval = FALSE}
autoplot(mlp_reg_tune) +
scale_color_viridis_d(direction = -1) +
theme(legend.position = "top")
```
```{r regular-grid-plot, ref.label = "grid-cells-reg-plot"}
#| fig.height = 7,
#| echo = FALSE,
#| fig.cap = "The regular grid results",
#| fig.alt = "A line plot of the regular grid results. The x axis shows the number of hidden units and the y axis is the resampled ROC AUC. There are separate lines for the amount of regularization. There are nine panels for three values for the number of PCA components and the number of epochs. On average, the amount of regularization is important where more is better. Also, on average, the increasing the number of hidden units decreases model effectiveness."
```
For these data, the amount of penalization has the largest impact on the area under the ROC curve. The number of epochs doesn't appear to have a pronounced effect on performance. The change in the number of hidden units appears to matter most when the amount of regularization is low (and harms performance). There are several parameter configurations that have roughly equivalent performance, as seen using the function `show_best()`:
```{r grid-cells-reg-best}
show_best(mlp_reg_tune) %>% select(-.estimator)
```
Based on these results, it would make sense to conduct another run of grid search with larger values of the weight decay penalty.
To use a space-filling design, either the `grid` argument can be given an integer or one of the `grid_*()` functions can produce a data frame. To evaluate the same range using a maximum entropy design with 20 candidate values:
```{r grid-cells-sdf}
set.seed(1306)
mlp_sfd_tune <-
mlp_wflow %>%
tune_grid(
cell_folds,
grid = 20,
# Pass in the parameter object to use the appropriate range:
param_info = mlp_param,
metrics = roc_res
)
mlp_sfd_tune
```
The `autoplot()` method will also work with these designs, although the format of the results will be different. Figure \@ref(fig:sfd-plot) was produced using `autoplot(mlp_sfd_tune)`.
```{r sfd-plot}
#| echo = FALSE,
#| fig.cap = "The `autoplot()` method results when used with a space-filling design",
#| fig.alt = "The `autoplot()` method results when used with a space-filling design. The trends show decreasing performance with the number of PCA components as well as the number of hidden units."
autoplot(mlp_sfd_tune)
```
This marginal effects plot (Figure \@ref(fig:sfd-plot)) shows the relationship of each parameter with the performance metric.
:::rmdwarning
Take care when examining this plot; since a regular grid is not used, the values of the other tuning parameters can affect each panel.
:::
The penalty parameter appears to result in better performance with smaller amounts of weight decay. This is the opposite of the results from the regular grid. Since each point in each panel is shared with the other three tuning parameters, the trends in one panel can be affected by the others. Using a regular grid, each point in each panel is equally averaged over the other parameters. For this reason, the effect of each parameter is better isolated with regular grids.
As with the regular grid, `show_best()` can report on the numerically best results:
```{r grid-cells-sdf-best}
show_best(mlp_sfd_tune) %>% select(-.estimator)
```
Generally, it is a good idea to evaluate the models over multiple metrics so that different aspects of the model fit are taken into account. Also, it often makes sense to choose a slightly suboptimal parameter combination that is associated with a simpler model. For this model, simplicity corresponds to larger penalty values and/or fewer hidden units.
As with the results from `fit_resamples()`, there is usually no value in retaining the intermediary model fits across the resamples and tuning parameters. However, as before, the `extract` option to `control_grid()` allows the retention of the fitted models and/or recipes. Also, setting the `save_pred` option to `TRUE` retains the assessment set predictions and these can be accessed using `collect_predictions()`.
## Finalizing the Model
If one of the sets of possible model parameters found via `show_best()` were an attractive final option for these data, we might wish to evaluate how well it does on the test set. However, the results of `tune_grid()` only provide the substrate to choose appropriate tuning parameters. The function _does not fit_ a final model.
To fit a final model, a final set of parameter values must be determined. There are two methods to do so:
- manually pick values that appear appropriate or
- use a `select_*()` function.
For example, `select_best()` will choose the parameters with the numerically best results. Let's go back to our regular grid results and see which one is best:
```{r grid-select}
select_best(mlp_reg_tune, metric = "roc_auc")
```
Looking back at Figure \@ref(fig:regular-grid-plot), we can see that a model with a single hidden unit trained for 125 epochs on the original predictors with a large amount of penalization has performance competitive with this option, and is simpler. This is basically penalized logistic regression! To manually specify these parameters, we can create a tibble with these values and then use a _finalization_ function to splice the values back into the workflow:
```{r grid-finalize-manual}
logistic_param <-
tibble(
num_comp = 0,
epochs = 125,
hidden_units = 1,
penalty = 1
)
final_mlp_wflow <-
mlp_wflow %>%
finalize_workflow(logistic_param)
final_mlp_wflow
```
No more values of `tune()` are included in this finalized workflow. Now the model can be fit to the entire training set:
```{r grid-final-model}
final_mlp_fit <-
final_mlp_wflow %>%
fit(cells)
```
This object can now be used to make future predictions on new data.
If you did not use a workflow, finalization of a model and/or recipe is done using `finalize_model()` and `finalize_recipe()`.
## Tools for Creating Tuning Specifications {#tuning-usemodels}
The `r pkg(usemodels)` package can take a data frame and model formula, then write out R code for tuning the model. The code also creates an appropriate recipe whose steps depend on the requested model as well as the predictor data.
For example, for the Ames housing data, `xgboost` modeling code could be created with:
```{r models-use-models-code, eval = FALSE}
library(usemodels)
use_xgboost(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude,
data = ames_train,
# Add comments explaining some of the code:
verbose = TRUE)
```
The resulting code is as follows:
```{r models-use-models-res, eval = FALSE}
xgboost_recipe <-
recipe(formula = Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_novel(all_nominal_predictors()) %>%
## This model requires the predictors to be numeric. The most common
## method to convert qualitative predictors to numeric is to create
## binary indicator variables (aka dummy variables) from these
## predictors. However, for this model, binary indicator variables can be
## made for each of the levels of the factors (known as 'one-hot
## encoding').
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(69305)
xgboost_tune <-
tune_grid(xgboost_workflow,
resamples = stop("add your rsample object"),
grid = stop("add number of candidate points"))
```
Based on what `r pkg(usemodels)` understands about the data, this code is the minimal preprocessing required. For other models, operations like `step_normalize()` are added to fulfill the basic needs of the model. Notice that it is our responsibility, as the modeling practitioner, to choose what `resamples` to use for tuning, as well as what kind of `grid`.
:::rmdnote
The `r pkg(usemodels)` package can also be used to create model fitting code with no tuning by setting the argument `tune = FALSE`.
:::
## Tools for Efficient Grid Search {#efficient-grids}
It is possible to make grid search more computationally efficient by applying a few different tricks and optimizations. This section describes several techniques.
### Submodel optimization {#submodel-trick}
There are types of models where, from a single model fit, multiple tuning parameters can be evaluated without refitting.
For example, partial least squares (PLS) is a supervised version of principal component analysis [@Geladi:1986]. It creates components that maximize the variation in the predictors (like PCA) but simultaneously tries to maximize the correlation between these predictors and the outcome. We'll explore PLS more in Chapter \@ref(dimensionality). One tuning parameter is the number of PLS components to retain. Suppose that a data set with 100 predictors is fit using PLS. The number of possible components to retain can range from one to fifty. However, in many implementations, a single model fit can compute predicted values across many values of `num_comp`. As a result, a PLS model created with 100 components can also make predictions for any `num_comp <= 100`. This saves time since, instead of creating redundant model fits, a single fit can be used to evaluate many submodels.
While not all models can exploit this feature, many broadly used ones do.
* Boosting models can typically make predictions across multiple values for the number of boosting iterations.
* Regularization methods, such as the `r pkg(glmnet)` model, can make simultaneous predictions across the amount of regularization used to fit the model.
* Multivariate adaptive regression splines (MARS) adds a set of nonlinear features to linear regression models [@Friedman:1991p109]. The number of terms to retain is a tuning parameter, and it is computationally fast to make predictions across many values of this parameter from a single model fit.
The `r pkg(tune)` package automatically applies this type of optimization whenever an applicable model is tuned.
For example, if a boosted C5.0 classification model [@apm] was fit to the cell data, we can tune the number of boosting iterations (`trees`). With all other parameters set at their default values, we can evaluate iterations from 1 to 100 on the same resamples as used previously:
```{r grid-c5, eval = FALSE}
c5_spec <-
boost_tree(trees = tune()) %>%
set_engine("C5.0") %>%
set_mode("classification")
set.seed(1307)
c5_spec %>%
tune_grid(
class ~ .,
resamples = cell_folds,
grid = data.frame(trees = 1:100),
metrics = roc_res
)
```
Without the submodel optimization, the call to `tune_grid()` used `r round(3734.249/60, 1)` minutes to resample 100 submodels. With the optimization, the same call took 100 _seconds_ (a `r round(3734.249/100.147, 0)`-fold speed-up). The reduced time is the difference in `tune_grid()` fitting 1000 models versus 10 models.
:::rmdnote
Even though we fit the model with and without the submodel prediction trick, this optimization is automatically applied by `r pkg(parsnip)`.
:::
### Parallel processing
As previously mentioned in Section \@ref(parallel), parallel processing is an effective method for decreasing execution time when resampling models. This advantage conveys to model tuning via grid search, although there are additional considerations.
Let's consider two different parallel processing schemes.
When tuning models via grid search, there are two distinct loops: one over resamples and another over the unique tuning parameter combinations. In pseudocode, this process would look like:
```{r grid-resamples-algo, eval = FALSE}
for (rs in resamples) {
# Create analysis and assessment sets
# Preprocess data (e.g. formula or recipe)
for (mod in configurations) {
# Fit model {mod} to the {rs} analysis set
# Predict the {rs} assessment set
}
}
```
By default, the `r pkg(tune)` package parallelizes only over resamples (the outer loop), as opposed to both the outer and inner loops.
This is the optimal scenario when the preprocessing method is expensive. However, there are two potential downsides to this approach:
* It limits the achievable speed-ups when the preprocessing is not expensive.
* The number of parallel workers is limited by the number of resamples. For example, with 10-fold cross-validation you can use only 10 parallel workers even when the computer has more than 10 cores.
To illustrate how the parallel processing works, we'll use a case where there are 7 model tuning parameter values, with 5-fold cross-validation. Figure \@ref(fig:one-resample-per-worker) shows how the tasks are allocated to the worker processes.
```{r one-resample-per-worker}
#| echo = FALSE,
#| message = FALSE,
#| fig.width = 5,
#| fig.height = 4,
#| out.width = "50%",
#| fig.cap = "Worker processes when parallel processing matches resamples to a specific worker process",
#| fig.alt = "A diagram of the worker processes when parallel processing matches resamples to a specific worker process. After the preprocess operations are finished, each model fit is executed on the same worker process."
load("extras/parallel_times/resamples_times.RData")
resamples_times %>%
dplyr::rename(operation = label) %>%
ggplot(aes(y = id_alt, x = duration, fill = operation)) +
geom_bar(stat = "identity", color = "black") +
labs(y = NULL, x = "Elapsed Time") +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "top")
```
Note that each fold is assigned to its own worker process and, since only model parameters are being tuned, the preprocessing is conducted once per fold/worker. If fewer than five worker processes were used, some workers would receive multiple folds.
In the control functions for the `tune_*()` functions, the argument `parallel_over` controls how the process is executed. To use the previous parallelization strategy, the argument is `parallel_over = "resamples"`.
Instead of parallel processing the resamples, an alternate scheme combines the loops over resamples and models into a single loop. In pseudocode, this process would look like:
```{r grid-everything-algo, eval = FALSE}
all_tasks <- crossing(resamples, configurations)
for (iter in all_tasks) {
# Create analysis and assessment sets for {iter}
# Preprocess data (e.g. formula or recipe)
# Fit model {iter} to the {iter} analysis set
# Predict the {iter} assessment set
}
```
In this case, parallelization now occurs over the single loop. For example, if we use 5-fold cross-validation with $M$ tuning parameter values, the loop is executed over $5\times M$ iterations. This increases the number of potential workers that can be used. However, the work related to data preprocessing is repeated multiple times. If those steps are expensive, this approach will be inefficient.
In tidymodels, validation sets are treated as a single resample. In these cases, this parallelization scheme would be best.
Figure \@ref(fig:distributed-tasks) illustrates the delegation of tasks to the workers in this scheme; the same example is used but with 10 workers.
```{r distributed-tasks, echo = FALSE, message = FALSE, fig.height = 7}
#| echo = FALSE,
#| message = FALSE,
#| fig.height = 7,
#| out.width = "70%",
#| fig.cap = "Worker processes when preprocessing and modeling tasks are distributed to many workers",
#| fig.alt = "A diagram of the worker processes when preprocessing and modeling tasks are distributed to many workers. In this instance, more comprehensive parallelization is used but some preprocessing tasks are repeated across worker processes."
load("extras/parallel_times/everything_times.RData")
repeats <-
everything_times %>%
dplyr::filter(id == "Fold1" & event == "preproc") %>%
nrow()
everything_times <-
everything_times %>%
dplyr::rename(operation = label) %>%
mutate(operation = as.character(operation)) %>%
arrange(pid, id, operation) %>%
select(pid, id, operation, duration)
start_stop <-
everything_times %>%
pivot_wider(
id_cols = c(pid, id),
names_from = "operation",
values_from = "duration"
) %>%
group_by(pid) %>%
mutate(
total = model + preprocess,
.stop_mod = cumsum(total),
prev = dplyr::lag(total, 1),
prev = ifelse(is.na(prev), 0, prev),
.start_pre = cumsum(prev),
.stop_pre = .start_pre + preprocess,
.start_mod = .stop_pre
) %>%
ungroup() %>%
select(pid, id, .start_pre, .stop_pre, .start_mod, .stop_mod)
starts <-
start_stop %>%
select(pid, id, contains("start")) %>%
pivot_longer(
cols = c(.start_pre, .start_mod),
values_to = ".start"
) %>%
mutate(
operation = ifelse(grepl("mod$", name), "model", "preprocess")
) %>%
select(-name)
stops <-
start_stop %>%
select(pid, id, contains("stop")) %>%
pivot_longer(
cols = c(.stop_pre, .stop_mod),
values_to = ".stop"
) %>%
mutate(
operation = ifelse(grepl("mod$", name), "model", "preprocess")
) %>%
select(-name)
id_offset <- 0.4
start_stop_dat <-
full_join(starts, stops, by = c("pid", "id", "operation")) %>%
mutate(
id_num = as.numeric(factor(id)),
id_start = id_num - id_offset,
id_stop = id_num + id_offset
)
start_stop_dat %>%
ggplot(aes(y = id_num, x = .start)) +
geom_rect(
aes(
xmin = .start,
xmax = .stop,
ymin = id_start,
ymax = id_stop,
fill = operation
),
color = "black"
) +
facet_wrap(~ pid, nrow = 2) +
labs(y = NULL, x = "Elapsed Time") +
scale_fill_brewer(palette = "Paired") +
scale_y_continuous(breaks = 1:5, labels = paste("Fold", 1:5)) +
theme_bw() +
theme(legend.position = "top", panel.grid.minor = element_blank())
```
Here, each worker process handles multiple folds, and the preprocessing is needlessly repeated. For example, for the first fold, the preprocessing was computed `r xfun::numbers_to_words(repeats)` times instead of once.
For this scheme, the control function argument is `parallel_over = "everything"`.
### Benchmarking boosted trees
To compare different possible parallelization schemes, we tuned a boosted tree with the `r pkg(xgboost)` engine using a data set of 4,000 samples, with 5-fold cross-validation and 10 candidate models. These data required some baseline preprocessing that did not require any estimation. The preprocessing was handled three different ways:
1. Preprocess the data prior to modeling using a `r pkg(dplyr)` pipeline (labeled as "none" in the later plots).
2. Conduct the same preprocessing via a recipe (shown as "light" preprocessing).
3. With a recipe, add an additional step that has a high computational cost (labeled as "expensive").
The first and second preprocessing options are designed for comparison, to measure the computational cost of the recipe in the second option. The third option measures the cost of performing redundant computations with `parallel_over = "everything"`.
We evaluated this process using variable numbers of worker processes and using the two `parallel_over` options, on a computer with 10 physical cores and 20 virtual cores (via hyper-threading).
First, let's consider the raw execution times in Figure \@ref(fig:parallel-times).
```{r parallel-times}
#| echo = FALSE,
#| message = FALSE,
#| fig.height =4,
#| out.width = "70%",
#| fig.cap = "Execution times for model tuning versus the number of workers using different delegation schemes",
#| fig.alt = "Execution times for model tuning versus the number of workers using different delegation schemes."
load("extras/parallel_times/xgb_times.RData")
ggplot(times, aes(x = num_cores, y = elapsed, color = parallel_over, shape = parallel_over)) +
geom_point(size = 2) +
geom_line() +
facet_wrap(~ preprocessing) +
labs(x = "Number of Workers", y = "Execution Time (s)") +
scale_y_log10() +
scale_color_manual(values = c("#7FC97F", "#386CB0")) +
theme_bw() +
theme(legend.position = "top")
```
Since there were only five resamples, the number of cores used when `parallel_over = "resamples"` is limited to five.
Comparing the curves in the first two panels for "none" and "light":
* There is little difference in the execution times between the panels. This indicates, for these data, there is no real computational penalty for doing the preprocessing steps in a recipe.
* There is some benefit for using `parallel_over = "everything"` with many cores. However, as shown in the figure, the majority of the benefit of parallel processing occurs in the first five workers.
With the expensive preprocessing step, there is a considerable difference in execution times. Using `parallel_over = "everything"` is problematic since, even using all cores, it never achieves the execution time that `parallel_over = "resamples"` attains with just five cores. This is because the costly preprocessing step is unnecessarily repeated in the computational scheme.
We can also view these data in terms of speed-ups in Figure \@ref(fig:parallel-speedups).
```{r parallel-speedups}
#| echo = FALSE,
#| message = FALSE,
#| fig.height = 4,
#| out.width = "70%",
#| fig.cap = "Speed-ups for model tuning versus the number of workers using different delegation schemes. The diagonal black line indicates a linear speedup where the addition of a new worker process has maximal effect.",
#| fig.alt = "Speed-ups for model tuning versus the number of workers using different delegation schemes. The diagonal black line indicates a linear speedup where the addition of a new worker process has maximal effect. The 'everything' scheme shows that the benefits decrease after three or four workers, especially when there is expensive preprocessing. The 'resamples' scheme has almost linear speedups across all tasks."
ggplot(times, aes(x = num_cores, y = speed_up, color = parallel_over, shape = parallel_over)) +
geom_abline(lty = 1) +
geom_point(size = 2) +
geom_line() +
facet_wrap(~ preprocessing) +
coord_obs_pred() +
scale_color_manual(values = c("#7FC97F", "#386CB0")) +
labs(x = "Number of Workers", y = "Speed-up") +
theme(legend.position = "top")
```
The best speed-ups, for these data, occur when `parallel_over = "resamples"` and when the computations are expensive. However, in the latter case, remember that the previous analysis indicates that the overall model fits are slower.
What is the benefit of using the submodel optimization method in conjunction with parallel processing? The C5.0 classification model shown in Section \@ref(submodel-trick) was also run in parallel with ten workers. The parallel computations took 13.3 seconds for a `r round(100.147/13.265, 1)`-fold speed-up (both runs used the submodel optimization trick). Between the submodel optimization trick and parallel processing, there was a total `r round(3734.249/13.265, 0)`-fold speed-up over the most basic grid search code.
:::rmdwarning
Overall, note that the increased computational savings will vary from model to model and are also affected by the size of the grid, the number of resamples, etc. A very computationally efficient model may not benefit as much from parallel processing.
:::
### Access to global variables
When using tidymodels, it is possible to use values in your local environment (usually the global environment) in model objects.
:::rmdnote
What do we mean by "environment" here? Think of an environment in R as a place to store variables that you can work with. See the "Environments" chapter of @wickham2019advanced to learn more.
:::
If we define a variable to use as a model parameter and then pass it to a function like `linear_reg()`, the variable is typically defined in the global environment.
```{r grid-spec-global}
coef_penalty <- 0.1
spec <- linear_reg(penalty = coef_penalty) %>% set_engine("glmnet")
spec
```
Models created with the parsnip package save arguments like these as _quosures_; these are objects that track both the name of the object as well as the environment where it lives:
```{r grid-spec-quo}
spec$args$penalty
```
Notice that we have `env: global` because this variable was created in the global environment. The model specification defined by `spec` works correctly when run in a user's regular session because that session is also using the global environment; R can easily find the object `coef_penalty`.
:::rmdwarning
When such a model is evaluated with parallel workers, it may fail. Depending on the particular technology that is used for parallel processing, the workers may not have access to the global environment.
:::
When writing code that will be run in parallel, it is a good idea to insert the actual data into the objects rather than the reference to the object. The `r pkg(rlang)` and `r pkg(dplyr)` packages can be very helpful for this. For example, the `!!` operator can splice a single value into an object:
```{r grid-spec-bang-bang}
spec <- linear_reg(penalty = !!coef_penalty) %>% set_engine("glmnet")
spec$args$penalty
```
Now the output is `^0.1`, indicating that the value is there instead of the reference to the object. When you have multiple external values to insert into an object, the `!!!` operator can help:
```{r grid-spec-bang-bang-bang}
mcmc_args <- list(chains = 3, iter = 1000, cores = 3)
linear_reg() %>% set_engine("stan", !!!mcmc_args)
```
Recipe selectors are another place where you might want access to global variables. Suppose you have a recipe step that should use all of the predictors in the cell data that were measured using the second optical channel. We can create a vector of these column names:
```{r grid-ch-2}
library(stringr)
ch_2_vars <- str_subset(names(cells), "ch_2")
ch_2_vars
```
We could hard-code these into a recipe step but it would be better to reference them programmatically in case the data change. Two ways to do this are:
```{r grid-ch-2-rec}
# Still uses a reference to global data (~_~;)
recipe(class ~ ., data = cells) %>%
step_spatialsign(all_of(ch_2_vars))
# Inserts the values into the step ヽ(•‿•)ノ
recipe(class ~ ., data = cells) %>%
step_spatialsign(!!!ch_2_vars)
```
The latter is better for parallel processing because all of the needed information is embedded in the recipe object.
### Racing methods {#racing}
One issue with grid search is that all models need to be fit across all resamples before any tuning parameters can be evaluated. It would be helpful if instead, at some point during tuning, an interim analysis could be conducted to eliminate any truly awful parameter candidates. This would be akin to _futility analysis_ in clinical trials. If a new drug is performing excessively poorly (or well), it is potentially unethical to wait until the trial finishes to make a decision.
In machine learning, the set of techniques called _racing methods_ provide a similar function [@maron1994hoeffding]. Here, the tuning process evaluates all models on an initial subset of resamples. Based on their current performance metrics, some parameter sets are not considered in subsequent resamples.
```{r grid-cells-race, include = FALSE}
set.seed(1308)
mlp_sfd_race <-
mlp_wflow %>%
tune_race_anova(
cell_folds,
grid = 20,
# Pass in the parameter object to use the appropriate range:
param_info = mlp_param,
metrics = roc_res
)
remaining <-
mlp_sfd_race %>%
collect_metrics() %>%
dplyr::filter(n == 10)
```
As an example, in the multilayer perceptron tuning process with a regular grid explored in this chapter, what would the results look like after only the first three folds? Using techniques similar to those shown in Chapter \@ref(compare), we can fit a model where the outcome is the resampled area under the ROC curve and the predictor is an indicator for the parameter combination. The model takes the resample-to-resample effect into account and produces point and interval estimates for each parameter setting. The results of the model are one-sided 95% confidence intervals that measure the loss of the ROC value relative to the currently best performing parameters, as shown in Figure \@ref(fig:racing-process).
```{r racing-process}
#| echo = FALSE,
#| message = FALSE,
#| warning = FALSE,
#| fig.height = 5,
#| out.width = "80%",
#| fig.cap = "The racing process for 20 tuning parameters and 10 resamples",
#| fig.alt = "An illustration of the racing process for 20 tuning parameters and 10 resamples. The analysis is conducted at the first, third, and last resample. As the number of resamples increases, the confidence intervals show some model configurations that do not have confidence intervals that overlap with zero. These are excluded from subsequent resamples."
full_att <- attributes(mlp_sfd_race)
race_details <- NULL
for(iter in 1:10) {
tmp <- mlp_sfd_race %>% filter(.order <= iter)
tmp_att <- full_att
tmp_att$row.names <- attr(tmp, "row.names")
attributes(tmp) <- tmp_att
if (nrow(show_best(tmp)) == 1) {
break()
}
race_details <-
bind_rows(
race_details,
finetune:::test_parameters_gls(tmp) %>% mutate(iter = iter))
}
race_details <-
race_details %>%
mutate(
lower = ifelse(iter < 3, NA, lower),
upper = ifelse(iter < 3, NA, upper),
pass = ifelse(iter < 3, TRUE, pass),
decision = ifelse(pass, "retain", "discard"),
decision = ifelse(pass & estimate == 0, "best", decision)
) %>%
mutate(
.config = factor(.config),
.config = reorder(.config, estimate),
decision = factor(decision, levels = c("best", "retain", "discard"))
)
race_cols <- c(best = "blue", retain = "black", discard = "grey")
iter_three <- race_details %>% dplyr::filter(iter == 3)
iter_three %>%
ggplot(aes(x = -estimate, y = .config)) +
geom_vline(xintercept = 0, lty = 2, color = "green") +
geom_point(size = 2, aes(color = decision)) +
geom_errorbarh(aes(xmin = -estimate, xmax = -upper, color = decision), height = .3, show.legend = FALSE) +
labs(x = "Loss of ROC AUC", y = NULL) +
scale_colour_manual(values = race_cols)
```
Any parameter set whose confidence interval includes zero would lack evidence that its performance is statistically different from the best results. We retain `r sum(iter_three$upper >= 0)` settings; these are resampled more. The remaining `r sum(iter_three$upper < 0)` submodels are no longer considered.
```{r grid-mlp-racing-anim, include = FALSE, dev = 'png'}
race_ci_plots <- function(x, iters = max(x$iter)) {
x_rng <- extendrange(c(-x$estimate, -x$upper))
for (i in 1:iters) {
if (i < 3) {
ttl <- paste0("Iteration ", i, ": burn-in")
} else {
ttl <- paste0("Iteration ", i, ": testing")
}
p <-
x %>%
dplyr::filter(iter == i) %>%
ggplot(aes(x = -estimate, y = .config, color = decision)) +
geom_vline(xintercept = 0, color = "green", lty = 2) +
geom_point(size = 2) +
labs(title = ttl, y = "", x = "Loss of ROC AUC") +
scale_color_manual(values = c(best = "blue", retain = "black", discard = "grey"),
drop = FALSE) +
scale_y_discrete(drop = FALSE) +
xlim(x_rng) +
theme_bw() +
theme(legend.position = "top")
if (i >= 3) {
p <- p + geom_errorbar(aes(xmin = -estimate, xmax = -upper), width = .3)
}
print(p)
}
invisible(NULL)
}
av_capture_graphics(
race_ci_plots(race_details),
output = "race_results.mp4",
width = 720,
height = 720,
res = 120,
framerate = 1/3
)
```
<video width="720" height="720" controls>
<source src="race_results.mp4" type="video/mp4">
</video>
The process continues for each resample; after the next set of performance metrics, a new model is fit to these statistics, and more submodels are potentially discarded.^[See @kuhn2014futility for more details on the computational aspects of this approach.]
:::rmdwarning
Racing methods can be more efficient than basic grid search as long as the interim analysis is fast and some parameter settings have poor performance. It also is most helpful when the model does _not_ have the ability to exploit submodel predictions.
:::
The `r pkg(finetune)` package contains functions for racing. The `tune_race_anova()` function conducts an ANOVA model to test for statistical significance of the different model configurations. The syntax to reproduce the filtering shown previously is:
```{r grid-cells-race-code, eval = FALSE}
library(finetune)
set.seed(1308)
mlp_sfd_race <-
mlp_wflow %>%
tune_race_anova(
cell_folds,
grid = 20,
param_info = mlp_param,
metrics = roc_res,
control = control_race(verbose_elim = TRUE)
)
```
The arguments mirror those of `tune_grid()`. The function `control_race()` has options for the elimination procedure.
As shown in the animation above, there were `r xfun::numbers_to_words(nrow(remaining))` tuning parameter combinations under consideration once the full set of resamples were evaluated. `show_best()` returns the best models (ranked by performance) but returns only the configurations that were never eliminated:
```{r grid-race-best}
show_best(mlp_sfd_race, n = 10)
```
There are other interim analysis techniques for discarding settings. For example, @krueger15a use traditional sequential analysis methods whereas @kuhn2014futility treats the data as a sports competition and uses the Bradley-Terry model [@bradley1952rank] to measure the winning ability of parameter settings.
## Chapter Summary {#grid-summary}
This chapter discussed the two main classes of grid search (regular and non-regular) that can be used for model tuning and demonstrated how to construct these grids, either manually or using the family of `grid_*()` functions. The `tune_grid()` function can evaluate these candidate sets of model parameters using resampling. The chapter also showed how to finalize a model, recipe, or workflow to update the parameter values for the final fit. Grid search can be computationally expensive, but thoughtful choices in the experimental design of such searches can make them tractable.
The data analysis code that will be reused in the next chapter is:
```{r resampling-summary, eval = FALSE}
library(tidymodels)
data(cells)
cells <- cells %>% select(-case)
set.seed(1304)
cell_folds <- vfold_cv(cells)
roc_res <- metric_set(roc_auc)
```