-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path04-formulation.Rmd
63 lines (48 loc) · 4.04 KB
/
04-formulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Model formulation
The aim of supervised ML is to find a model $\hat{Y} = f(X)$ so that $\hat{Y}$ agrees well with observations $Y$. We typically start with a research question where $Y$ is given - naturally - by the problem we are addressing and we have a data set at hand where one or multiple predictors (or features) $X$ are recorded along with $Y$. From our data, we have information about how GPP (ecosystem-level photosynthesis) depends on set of abiotic factors, mostly meteorological measurements.
## Formula notation
In R, it is common to use the *formula* notation to specify the target and predictor variables. You have probably encountered formulas before, e.g., for a linear regression using the `lm()` function. To specify a linear regression model for `GPP_NT_VUT_REF` with three predictors `SW_F_IN`, `VPD_F`, and `TA_F`, we write:
```{r eval=F}
lm(GPP_NT_VUT_REF ~ SW_F_IN + VPD_F + TA_F, data = ddf)
```
## The generic `train()`
Actually, the way we formulate a model is independent of the algorithm, or *engine* that takes care of fitting $f(X)$. As mentioned in Chapter \@ref(preprocessing) the R package [**caret**](https://topepo.github.io/caret/) provides a unified interface for using different ML algorithms implemented in separate packages. In other words, it acts as a *wrapper* for multiple different model fitting, or ML algorithms. This has the advantage that it unifies the interface (the way arguments are provided). caret also provides implementations for a set of commonly used tools for data processing, model training, and evaluation. We'll use caret for model training with the function `train()` (more on model training in Chapter \@ref(training)). Note however, that using a specific algorithm, which is implemented in a specific package outside caret, also requires that the respective package be installed and loaded. Using caret for specifying the same linear regression model as above, the base-R `lm()` function, can be done with caret in a generalized form as:
```{r message=F, warning=F}
library(caret)
train(
form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F,
data = ddf %>% drop_na(),
method = "lm"
)
```
Of course, this is an overkill compared to just writing `lm(...)`. But the advantage of the unified interface is that we can simply replace the `method` argument to use a different ML algorithm. For example, to use a random forest model implemented by the **ranger** package, we can write:
```{r eval=F}
## do not run
train(
form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F,
data = ddf,
method = "ranger",
...
)
```
The `...` hints at the fact that there are a few more arguments to be specified for training a random forest model with ranger. More on that in Chapter \@ref(training).
## Recipes
The [**recipes**](https://recipes.tidymodels.org/) package provides another way to specify the *formula* and pre-processing steps in one go and is compatible with caret's `train()` function. For the same formula as above, and an example where the data `ddf_train` is to be centered and scaled, we can specify the "recipe" using the *tidyverse*-style pipe operator as:
```{r}
library(recipes)
pp <- recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = ddf_train) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())
```
The first line assigns *roles* to the different variables. `GPP_NT_VUT_REF` is an *outcome* (in "recipes speak"). Then, we used selectors to apply the recipe step to several variables at once. The first selector, `all_numeric()`, selects all variables that are either integers or real values. The second selector, `-all_outcomes()` removes any outcome (target) variables from this recipe step.
The object `pp` can then be supplied to `train()` as its first argument:
```{r, eval=FALSE}
## do not run
train(
pp,
data = ddf_train,
method = "ranger",
...
)
```
As seen above for the pre-processing example, this does not return a standardized version of the data frame `ddf_train`, but rather the information that allows us to apply the same standardization also to other data sets.