documentation.qmd

---
title: Forecasting Swiss Milk Prices with Machine Learning
format:
  gfm:
    output-file: README.md
    link-citations: true
bibliography: resources/references.bib
---

The Milk Price Forecasting Model is a pilot project of the Federal Office for Agriculture. It is implemented in an R script. It integrates various statistical and machine learning models, offering a detailed analysis of milk price trends and predictions.

## User Guide

In order to run the milk price forecasting model, a few conditions must be met:

- The excel file `milk-price-data.xlsx` with the relevant data must be found in the same directory as the R script.
- Additional values may be added directly to the excel file, but header names must remain the same.
- Before running the R script, the working directory must be set correctly.

Running the R script will trigger a series of processes including data preprocessing, model training and evaluation, and the generation of visualizations and reports.

The output from the script is multi-faceted -- but everything will be written into a directory called `output`. Two PDF reports are produced: the file `time-series-decomposition.pdf` summarizes the data and visualizes the seasonal decomposition of each variable, while `machine-learning-report.pdf` summarizes the models applied and their respective forecastings. In addition, various tables with numerical results are written into the directory `output`.

::: {#fig-elephants layout-ncol=3 layout-nrow=2}

![`correlations.xlsx`](resources/xlsx.svg){width=80px}

![`correlations.xlsx`](resources/xlsx.svg){width=80px}

![`correlations.xlsx`](resources/xlsx.svg){width=80px}

![`machine-learning-report.pdf`](resources/pdf.svg){width=60px}

![`time-series-decomposition.pdf`](resources/pdf.svg){width=60px}

Overview of the different files written into the directory `output` by the R script `script.R`.
:::

This documentation encapsulates the script's capabilities, underlying principles, and offers guidance for users to leverage its functionalities effectively.

# How does the script work?

## Setting up the work space

The script begins by setting up the necessary data environment, using `milk-price-data.xlsx` as its primary data source.
By default, it specifically targets `CH_Milchpreis` for prediction---although this might be changed by adjusting the variable `target`---, reserving the final 18 months of data for testing.
This number can be adjusted via the variable `test_set_size`.

The models incorporate a broad spectrum of features, encompassing various milk price types and related economic indicators, allowing for a comprehensive analysis.

## Exploratory data analysis

The first part of the code generates a PDF file called `time-series-decomposition.pdf`. As the name suggests, this file mainly incorporates seasonal decompositions, but also some more exploratory analyses.

### Correlation matrix and missing values

The PDF report starts by showing a visualization of all missing values in the data later used to train the forecasting models. This simply serves as an overview -- for the purpose of forecasting, `NA` values are replaced by zero.

Next, the report produces a correlation matrix of all features in the data set (@fig-correlations). Positively correlating features are higlighted in red, negatively correlated features in blue. As to be seen, most features are either positively or negatively correlated with each other -- ther are relatively few uncorrelated variabels. This means we are dealing with a high degree of multicollinearity in the feature matrix.

![Visualization of the correlation matrix of the feature data set. As to be seen, there are many positively and negatively correlating features -- as well as some that are seemingly independent from the others.](resources/correlations.png){#fig-correlations}

### Seasonal decomposition

The next part generates plots of a seasonal decomposition of the data (@fig-seasonal). It breaks down the time-series into trend, seasonal, and residual elements, offering insights into the underlying patterns that govern changes over time.

![Example of a seasonal decomposition plot for `CH_Milchpreis`. The milk price is decomposed into a trend (via a moving average), a seasonal effect and the remainder. The bars on the right indicate the relative magnitude of the effects.](resources/decomposition.png){#fig-seasonal}

Time series plots are generated for every single feature. However, if no significant seasonal effect is detected using Ollech and Webel's combined seasonality test, the plot shows the feature value over time only [@seastests2021; @ollech2020].

## Forecasting the Swiss milk price

### Data preparation

First, the data is prepared: The `time` feature is replaced by respective sine and cosine transformations. The target variable, i.e. `CH_Milchpreis`, is lagged with different forecast horizons, `h = 1`, `h = 2`, and `h = 3`. All `NA` values in the features are replaced by zero -- except for the ones created by lagging the target variable.

Furthermore, all original features are z-score normalized. This ensures a more reliable model fitting and also equivalent variable weight.

### Evaluation

To evaluate the different forecasting models, the root mean squared error is used (RMSE, @eq-rmse). It compares the predicted values of the milk price for any following period ($\hat x_{t+1}$) with the recorded value ($x_{t+1}$).

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{t=1}^{n} \left( x_{t+1} - \hat x_{t+1} \right)^2}$${#eq-rmse}

One apparent advantage of the RMSE as a performance metric is its interpretability -- it shares the same unit as the variable of interest (i.e., CHF) and represents the expected absolute deviation.

The prepared data is split into training and testing sets. The training set consists of all data up until `test_set_size = 18` months before the last entry. These observations are used to calibrate the model. Correspondingly, the latest `test_set_size = 18` observations are used to test the models. In the `machine-learning-report.pdf` file, the model performance on the test set is reported (@fig-performance).

![Performance metric (RMSE) for the four employed machine learning models -- each for a forecasting horizon of h = 1, 2, and 3 months.](resources/performance.png){#fig-performance}

The machine learning models are trained using cross-validation. As time series data, the values are heavily autocorrelated over time. To avoid overfitting, the folds for cross-validation are split over time, specifically in six month periods (@fig-crossvalidation).

```{r}
#| echo: false
#| eval: false
y <- c(80.7,80.2,79.7,78.3,75.6,78.3,79.2,80.2,81.1,80.3,79.6,79.7,79.5,79.3,76.6,76.2,77.7,80.1,81.6,82.5,83.2,82.1,81.3,81.2,80.5,80.5,78.6,77.7,76.3,78.7,79.4,77.2,78.6,80.6,77.3,75.9,75.4,75.6,73.3,72.5,72.6,74.2,74.5,77,78.2,78.9,77.4,76.8,76.3,75.9,74.5,72.3,70.6,73.1,74.3,76.1,77.1,76.6,75.1,74.9,74.3,72.8,72,71.3,68,70.5,71.9,74.2,74.7,74.7,73,72.7,72.8,71.7,71,70.1,69.9,71.5,72.6,74.7,75.1,74.6,73.7,73.1,70.4,68.2,67.5,66.9,66.5,69.4,71.3,73,73.8,74.4,75.7,75.4,77.2,74.9,73.5,73.3,72.5,76.4,82.1,83.6,84.3,83.9,79.8,79.9,74.2,71.2,68.8,64.9,63.7,66.6,64.5,64.5,65.1,65.7,64.4,64.2,64.1,63,62.4,60.5,60,61.9,63.3,64.5,65.3,65.6,64.3,64.1,62.9,62.9,62,62.6,62.7,64.4,64.9,65,65.2,65.5,63.5,62.8,62.5,62.9,61.5,61,59.4,60.5,61.5,61.4,62.3,63.3,62.9,63,63.2,62.9,62.5,62.1,62.2,66.1,67.7,68.3,69.7,70,70.9,70.2,70.3,69.8,69,67.6,67.7,68.5,69,69.2,69.4,67.3,66.1,64.9,63.4,61,59.1,58.9,59.2,60.6,62.2,62.9,64,64.8,64.3,62.9,62.6,61,59.4,57.4,58.1,59.3,61.2,61.2,61.8,62.8,62,61.7,61.9,60.3,58.6,58.4,59.3,60.9,63.6,63.8,64.8,66.6,66.1,65.3,64.3,63.2,60.9,60.3,60.6,63.4,65,66.1,66.8,66.9,65.8,65,64.3,61.5,60.4,60.4,60.5,63.2,65.3,66.9,68.8,68.1,66.9,66.5,65.8,63.6,63.3,63.1,62.7,65.2,67.5,68.4,69.2,69.6,67.9,67.8,69,67,66.8,66.4,66.6,69.4,71.5,72.2,72.9,73.4,72,71.9,71.6,69.9,69.1,70.3,72.5,75.1,78.6,79.4,80.1,80.9,78.9,79.5,78.5,76.5,74.5,73.8)
x <- seq(2000,2023+3/12,1/12)
par(mar = c(2,1,2,1))
plot.new()
plot.window(xlim = c(2000,2023), ylim = c(55,90), xaxs = "i")
for (i in 0:21) {
  polygon(x = c(2000+i,2000+i,2000+i+0.5,2000+i+0.5), y = c(0,100,100,0), col = "azure2")
  polygon(x = c(2000+i+0.5,2000+i+0.5,2000+i+1,2000+i+1), y = c(0,100,100,0), col = "azure3")
}
polygon(x = c(2023+3/12-18/12,2023+3/12-18/12,2023,2023), y = c(0,100,100,0), col = "firebrick1")
axis(1)
abline(v = seq(2000,2021.5,0.5), lty = 3)
abline(v = 2023+3/12-18/12)
box()
lines(x, y, lwd = 2)
text(x = c(2011,2022.3), y = 90, labels = c("Training data, split into 44 adherent folds for cross validation", "Test data"), pos = 3, xpd = NA, col = c("azure4","firebrick2"), font = 2)
```

![Visualization of the data division into test and training data, as well as the subdivision into folds of the training data for cross validation.](resources/crossvalidation.png){#fig-crossvalidation}

The pairs panel also included in the `machine-learning-report.pdf` file also illustrates the performance of the different models (@fig-pairs). Furthermore, it shows how different model predictions are correlated -- lasso and ridge regression make similar predictions, and so do the ARIMA and SARIMA models.

![Pair panels of the different model predictions as well as the observed values for a forecasting horizon of h = 1.](resources/pairs.png){#fig-pairs}

For the milk price forecasting, four different models are fit to the data: A lasso regression model, a ridge regression model, an ARIMA model as well as a SARIMA model. These models are briefly described here. The file `machine-learning-report.pdf` displays all the forecast values of the Swiss milk price (`CH_Milchpreis`, @fig-forecasting).

![Forecast Swiss milk price for the next three months and each model.](resources/forecasting.png){#fig-forecasting}

### Forecasting with lasso and ridge regression

Generally speaking, any forecasting model $f$ predicts a future values of $X$, say $X_{t+1}$, based on a past value $X_t$. Autoregressive models such as ARIMA and SARIMA predict following values based on the predictions already made recursively. For the ridge and lasso regression models, this is not possible: They only predict the Swiss milk price (denoted as $X_{1,t+h}$), while using many other features (@eq-forecasting). This means, after one round of predicting, these other values are missing for continueing autoregressively.

$$f: X_{1,t+h} = \beta_0 + \sum_{i=1}^p \beta_i X_{i,t}$${#eq-forecasting}

For the purpose of predicting milk prices more than one month into the future, three different models are trained in the script -- each with a different forecast horizons $h$. Consequently, three instances per model can be compared with each other in the end.[^arima]

[^arima]: The autoregressive models (ARIMA and SARIMA) don't need to be retrained for specific forecasting horizons, they can simply forecast next values based on the already forecast ones. Thus, the ARIMA and SARIMA models are only trained once.

Ridge and Lasso regression are especially effective in handling multicollinearity and preventing overfitting. Both models are linear in nature, i.e. the response variable $\mathbf y$ can be expressed as a linear combination of predictors $\mathbf X \boldsymbol \beta$ plus some residual $\boldsymbol \varepsilon$.

$$\mathbf y = \mathbf X \boldsymbol \beta + \boldsymbol \varepsilon$${#eq-linear}

Ridge Regression introduces an $\ell_2$-penalty proportional to the square of the coefficient magnitudes. On the other side, Lasso Regression employs an $\ell_1$-penalty, encouraging sparser solutions -- only a few predictors are selected.

```math
\boldsymbol {\hat \beta}_{\text{ridge}} = \min_{\boldsymbol \beta} \left\{ \| \mathbf y - \mathbf X \boldsymbol \beta \|_2^2 + \lambda \| \boldsymbol \beta\|_2^2 \right\} \qquad \boldsymbol {\hat \beta}_{\text{lasso}} = \min_{\boldsymbol \beta} \left\{ \| \mathbf y - \mathbf X \boldsymbol \beta \|_1 + \lambda \| \boldsymbol \beta\|_2^2 \right\}
```

The mathematical formulations for these regressions are centered around minimizing the sum of squared residuals, with added regularization terms ($\ell_2$-norm for Ridge and $\ell_1$-norm for Lasso).

The hyperparameter $\lambda$ which is penalizing large coefficients, is selected via cross-validation with the function `cv.glmnet` from the `glmnet` package. @fig-coefficients shows the magnitude of the coefficients under $\ell_1$ penalty (lasso regression), and $\ell_2$ penalty (ridge regression).

![Visualization of the coefficient magnitude under $\ell_1$ penalty (lasso regression), and $\ell_2$ penalty (ridge regression).](resources/coefficients.png){#fig-coefficients}

@fig-regularization shows both the cross-validated error as well as the coefficient magnitude as a response to an increasing penalty term $\lambda$. As to be seen, lasso regression leads to coefficients reaching zero one by one, while they only approach zero (but never reach it) with ridge regression.

![Cross-validated mean squared error as a response to an increasing penalty term $\lambda$ (top) as well as the magnidude of different coefficients as a response to an increasing penalty term $\lambda$ (bottom).](resources/regularization.png){#fig-regularization}

### Forcasting with (seasonal) ARIMA

Additionally, the script employs an autoregressive integrated moving average (ARIMA) and its seasonal variant (SARIMA). ARIMA is most suitable for non-seasonal data and combines autoregressive and moving average components. In contrast, SARIMA extends this to accommodate seasonal fluctuations in data, making it more robust for datasets with seasonal trends.

The two models are mainly fit to have a benchmark for the ridge and lasso regressio models.

Both models are fit using the `arima` function from the R package `stats`. In both cases, the order ($p$, $d$, $q$) needs to be specified. $p$ is the moving average order, $d$ the degree of differencing, and $q$ the autoregressive order. In both models, and also the seasonal component, the order is set to $p = 1$, $d = 1$ and $q = 1$.

# References {-}