04-formulation.Rmd

# Model formulation

The aim of supervised ML is to find a model $\hat{Y} = f(X)$ so that $\hat{Y}$ agrees well with observations $Y$. We typically start with a research question where $Y$ is given - naturally - by the problem we are addressing and we have a data set at hand where one or multiple predictors (or features) $X$ are recorded along with $Y$. From our data, we have information about how GPP (ecosystem-level photosynthesis) depends on set of abiotic factors, mostly meteorological measurements. 

## Formula notation

In R, it is common to use the *formula* notation to specify the target and predictor variables. You have probably encountered formulas before, e.g., for a linear regression using the `lm()` function. To specify a linear regression model for `GPP_NT_VUT_REF` with three predictors `SW_F_IN`, `VPD_F`, and `TA_F`, we write:

```{r eval=F}
lm(GPP_NT_VUT_REF ~ SW_F_IN + VPD_F + TA_F, data = ddf)
```

## The generic `train()`

Actually, the way we formulate a model is independent of the algorithm, or *engine* that takes care of fitting $f(X)$. As mentioned in Chapter \@ref(preprocessing) the R package [**caret**](https://topepo.github.io/caret/) provides a unified interface for using different ML algorithms implemented in separate packages. In other words, it acts as a *wrapper* for multiple different model fitting, or ML algorithms. This has the advantage that it unifies the interface (the way arguments are provided). caret also provides implementations for a set of commonly used tools for data processing, model training, and evaluation. We'll use caret for model training with the function `train()` (more on model training in Chapter \@ref(training)). Note however, that using a specific algorithm, which is implemented in a specific package outside caret, also requires that the respective package be installed and loaded. Using caret for specifying the same linear regression model as above, the base-R `lm()` function, can be done with caret in a generalized form as:

```{r message=F, warning=F}
library(caret)
train(
  form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, 
  data = ddf %>% drop_na(), 
  method = "lm"
)
```

Of course, this is an overkill compared to just writing `lm(...)`. But the advantage of the unified interface is that we can simply replace the `method` argument to use a different ML algorithm. For example, to use a random forest model implemented by the **ranger** package, we can write:

```{r eval=F}
## do not run
train(
  form = GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, 
  data = ddf, 
  method = "ranger",
  ...
)
```

The `...` hints at the fact that there are a few more arguments to be specified for training a random forest model with ranger. More on that in Chapter \@ref(training).

## Recipes

The [**recipes**](https://recipes.tidymodels.org/) package provides another way to specify the *formula* and pre-processing steps in one go and is compatible with caret's `train()` function. For the same formula as above, and an example where the data `ddf_train` is to be centered and scaled, we can specify the "recipe" using the *tidyverse*-style pipe operator as:
```{r}
library(recipes)
pp <- recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = ddf_train) %>% 
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())
```

The first line assigns *roles* to the different variables. `GPP_NT_VUT_REF` is an *outcome* (in "recipes speak"). Then, we used selectors to apply the recipe step to several variables at once. The first selector, `all_numeric()`, selects all variables that are either integers or real values. The second selector, `-all_outcomes()` removes any outcome (target) variables from this recipe step.

The object `pp` can then be supplied to `train()` as its first argument:
```{r, eval=FALSE}
## do not run
train(
  pp, 
  data = ddf_train, 
  method = "ranger",
  ...
)
```

As seen above for the pre-processing example, this does not return a standardized version of the data frame `ddf_train`, but rather the information that allows us to apply the same standardization also to other data sets.