Skip to content

Commit

Permalink
First working version
Browse files Browse the repository at this point in the history
  • Loading branch information
Nils authored and Nils committed Oct 10, 2024
1 parent d797661 commit 2f2b8af
Show file tree
Hide file tree
Showing 83 changed files with 1,878 additions and 334 deletions.
17 changes: 17 additions & 0 deletions book/_freeze/code_management/execute-results/html.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions book/_freeze/data_variety/execute-results/html.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions book/_freeze/data_vis/execute-results/html.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions book/_freeze/data_wrangling/execute-results/html.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions book/_freeze/interpretable-ml/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hash": "c3dce2af639231f5d4173ee1ddb4c2be",
"result": {
"engine": "knitr",
"markdown": "# Interpretable Machine Learning {#interpretableml}\n\nA great advantage of machine learning models is that they can capture non-linear relationships and interactions between predictors, and that they are effective at making use of large data volumes for learning even faint but relevant patterns thanks to their flexibility (high variance). However, their flexibility, and thus complexity, comes with the trade-off that models are hard to interpret. They are essentially black-box models - we know what goes in and we know what comes out and we can make sure that predictions are reliable (as described in previous chapters). However, we don't understand what the model learned. In contrast, a linear regression model can be easily interpreted by looking at the fitted coefficients and their statistics. \n\nThis motivates *interpretable machine learning*. There are two types of model interpretation methods: model-specific and model-agnostic interpretation. A simple example for a model-specific interpretation method is to compare the *t*-values of the fitted coefficients in a least squares linear regression model. Here, we will focus on the model-agnostic machine learning model interpretation and cover two types of model interpretations: quantifying variable importance, and determining partial dependencies (functional relationships between the target variable and a single predictor, while all other predictors are held constant).\n\nWe re-use the Random Forest model object which we created in Chapter \\@ref(randomforest). As a reminder, we predicted GPP from different environmental variables such as temperature, short-wave radiation, vapor pressure deficit, and others.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# The Random Forest model requires the following models to be loaded:\nrequire(caret)\nrequire(ranger)\n\nrf_mod <- readRDS(\"data/tutorials/rf_mod.rds\")\nrf_mod\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRandom Forest \n\n1910 samples\n 8 predictor\n\nRecipe steps: center, scale \nResampling: Cross-Validated (5 fold) \nSummary of sample sizes: 1528, 1528, 1529, 1527, 1528 \nResampling results:\n\n RMSE Rsquared MAE \n 1.412345 0.7042977 1.070494\n\nTuning parameter 'mtry' was held constant at a value of 2\nTuning\n parameter 'splitrule' was held constant at a value of variance\n\nTuning parameter 'min.node.size' was held constant at a value of 5\n```\n\n\n:::\n:::\n\n\n## Setup\nIn this Chapter, we will need the following libraries\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(ggplot2)\nlibrary(readr)\nlibrary(tidyr)\nlibrary(caret)\nlibrary(recipes)\n```\n:::\n\n\n## Variable importance\n\nA model-agnostic way to quantify variable importance is to permute (shuffle) the values of an individual predictor, re-train the model, and measure by how much the skill of the re-trained model has degraded in comparison to the model trained on the un-manipulated data. The metric, or loss function, for quantifying the model degradation can be any suitable metric for the respective model type. For a model predicting a continuous variable, we may use the RMSE. The algorithm works as follows (taken from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/iml.html#partial-dependence)):\n\n<!-- Permuting an important variable with random values will destroy any relationship between that variable and the response variable. The model's performance given by a loss function, e.g. its RMSE, will be compared between the non-permuted and permuted model to assess how influential the permuted variable is. A variable is considered to be important, when its permutation increases the model error relative to other variables. Vice versa, permuting an unimportant variable does not lead to a (strong) increase in model error. -->\n\n<!-- The PDPs discussed above give us a general feeling of how important a variable is in our model but they do not quantify this importance directly (but see measures for the \"flatness\" of a PDP [here](https://arxiv.org/abs/1805.04755)). However, we can measure variable importance directly through a permutation procedure. Put simply, this means that we replace values in our training dataset with random values (i.e., we permute the dataset) and assess how this permutation affects the model's performance. -->\n\n``` \n1. Compute loss function L for model trained on un-manipulated data\n2. For predictor variable i in {1,...,p} do\n | Permute values of variable i.\n | Fit model.\n | Estimate loss function Li.\n | Compute variable importance as Ii = Li/L or Ii = Li - L0.\n End\n3. Sort variables by descending values of Ii.\n```\n\nThis is implemented by the {vip} package. Note that the {vip} package has model-specific algorithms implemented but also takes model-agnostic arguments as done below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvip::vip(rf_mod, # Fitted model object\n train = rf_mod$trainingData |> \n dplyr::select(-TIMESTAMP), # Training data used in the model\n method = \"permute\", # VIP method\n target = \"GPP_NT_VUT_REF\", # Target variable\n nsim = 5, # Number of simulations\n metric = \"RMSE\", # Metric to assess quantify permutation\n sample_frac = 0.75, # Fraction of training data to use\n pred_wrapper = predict # Prediction function to use\n )\n```\n\n::: {.cell-output-display}\n![](interpretable-ml_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nThis indicates that shortwave radiation ('SW_IN_F') is the most important variable for modelling GPP here. I.e., the model performance degrades most (the RMSE increases most) if the information in shortwave radiation is lost. On the other extreme, atmospheric pressure adds practically no information to the model. This variable may therefore well be dropped from the model.\n\n## Partial dependence plots\n\nWe may not only want to know how important a certain variable is for modelling, but also how it influences the predictions. Is the relationship positive or negative? Is the sensitivity of predictions equal across the full range of the predictor? Again, model-agnostic approaches exist for determining the functional relationships (or partial dependencies) for predictors in a model. Partial dependence plots (PDP) give insight on the marginal effect of a single predictor variable on the response - all else equal. The algorithm to create PDPs goes as follows (adapted from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/iml.html#partial-dependence)):\n\n``` \nFor a selected predictor (x)\n1. Construct a grid of N evenly spaced values across the range of x: {x1, x2, ..., xN}\n2. For i in {1,...,N} do\n | Copy the training data and replace the original values of x with the constant xi\n | Apply the fitted ML model to obtain vector of predictions for each data point.\n | Average predictions across all data points.\n End\n3. Plot the averaged predictions against x1, x2, ..., xj\n```\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Visualisation of Partial Dependence Plot algorithm from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/index.html#acknowledgments). Here, `Gr_Liv_Area` is the variable of interest $x$.](figures/pdp-illustration.png){width=948}\n:::\n:::\n\n\nThis algorithm is implemented by the {pdp} package:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# The predictor variables are saved in our model's recipe\npreds <- \n rf_mod$recipe$var_info |> \n dplyr::filter(role == \"predictor\") |> \n dplyr::pull(variable)\n\n# The partial() function can take n=3 predictors at max and will try to create\n# a n-dimensional visulaisation to show interactive effects. However, \n# this is computational intensive, so we only look at the simple \n# response-predictor plots\nall_plots <- purrr::map(\n preds,\n ~pdp::partial(\n rf_mod, # Model to use\n ., # Predictor to assess\n plot = TRUE, # Whether output should be a plot or dataframe\n plot.engine = \"ggplot2\" # to return ggplot objects\n )\n)\n\npdps <- cowplot::plot_grid(all_plots[[1]], all_plots[[2]], all_plots[[3]], \n all_plots[[4]], all_plots[[5]], all_plots[[6]])\n\npdps\n```\n\n::: {.cell-output-display}\n![](interpretable-ml_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nThese PDPs show that the variables `TA_F`, `SW_IN_F`, and `LW_IN_F` have a strong effect, while `VPD_F`, `P_F`, and `WS_F` have a relatively small marginal effect as indicated by the small range in `yhat` - in line with the variable importance analysis shown above. In addition to the variable importance analysis, here we also see the *direction* of the effect and that how the sensitivity varies across the range of the respective predictor. For example, GPP is positively influenced by temperature (`TA_F`), but the effect really only starts to be expressed for temperatures above about -5$^\\circ$C, and the positive effect disappears above about 10$^\\circ$C. The pattern is relatively similar for `LW_IN_F`, which is sensible because long-wave radiation is highly correlated with temperature. For the short-wave radiation `SW_IN_F`, we see the saturating effect of light on GPP that we saw in previous chapters.\n\n<!--# TODO: Why does VPD have no negative effect on GPP at high values? Maybe this could be discussed in terms of a model not necessarily being able to capture physical processes.-->\n\n<!--# Should we include ICE? -->\n\n\n",
"supporting": [
"interpretable-ml_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
17 changes: 17 additions & 0 deletions book/_freeze/open_science/execute-results/html.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions book/_freeze/randomforest/execute-results/html.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions book/_freeze/regression_classification/execute-results/html.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions book/_freeze/supervised_ml_I/execute-results/html.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions book/_freeze/supervised_ml_II/execute-results/html.json

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions book/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,6 @@ book:
- supervised_ml_II.qmd
- randomforest.qmd
- interpretable-ml.qmd
- basicr.qmd
- ggplot.qmd
- references.qmd
favicon: "figures/favicon.ico"
twitter-card: true
Expand Down
2 changes: 2 additions & 0 deletions book/code_management.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ You will learn how to:
- Collaborate with others
- Ensure reproducibility of your project by openly sharing your work and progress.



## Tutorial

Code management is key for managing any data science project, especially when collaborating. Proper code management limits mistakes, such as code loss, and increases efficiency by structuring projects.
Expand Down
9 changes: 9 additions & 0 deletions book/data_variety.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,15 @@ In this chapter you will learn:
- how to read and/or write data in a particular file format
- how to query an API and store its data locally

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(ggplot2)
library(readr)
library(lubridate)
library(dplyr)
```

## Tutorial

### Files and file formats
Expand Down
9 changes: 9 additions & 0 deletions book/data_vis.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,15 @@ You will learn among others the:
- grammar of graphics, i.e., using the {ggplot2} library
- the proper use of colours in visualization

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(ggplot2)
library(readr)
library(lubridate)
library(dplyr)
```

## Tutorial

Visualizations often take the center stage of publications and are often the main vehicles for transporting information in scientific publications and (ever more often) in the media. Visualizations communicate data and its patterns in visual form. Visualizing data is also an integral part of the exploratory data analysis cycle. Visually understanding the data guides its transformation and the identification of suitable models and analysis methods.
Expand Down
12 changes: 12 additions & 0 deletions book/data_wrangling.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,17 @@ You will learn how to:
- Aggregate data
- handle bad and/or missing data


## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
library(lubridate)
library(tidyr)
library(readr)
library(stringr)
library(purrr)
```
## Tutorial

Exploratory data analysis - the transformation, visualization, and modelling of data - is the central part of any (geo-) data science workflow and typically takes up a majority of the time we spend on a research project. The transformation of data often turns out to be particularly (and often surprisingly) time-demanding. Therefore, it is key to master typical steps of data transformation, and to implement them in a transparent fashion and efficiently - both in terms of robustness against coding errors ("bugs") and in terms of code execution speed.
Expand Down Expand Up @@ -86,6 +97,7 @@ For our further data exploration, we will reduce the data frame we are working w
This is implemented by:

```{r}
half_hourly_fluxes <- select(
half_hourly_fluxes,
starts_with("TIMESTAMP"),
Expand Down
11 changes: 11 additions & 0 deletions book/interpretable-ml.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,17 @@ rf_mod <- readRDS("data/tutorials/rf_mod.rds")
rf_mod
```

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(caret)
library(recipes)
```

## Variable importance

A model-agnostic way to quantify variable importance is to permute (shuffle) the values of an individual predictor, re-train the model, and measure by how much the skill of the re-trained model has degraded in comparison to the model trained on the un-manipulated data. The metric, or loss function, for quantifying the model degradation can be any suitable metric for the respective model type. For a model predicting a continuous variable, we may use the RMSE. The algorithm works as follows (taken from [Boehmke & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/iml.html#partial-dependence)):
Expand Down
1 change: 1 addition & 0 deletions book/open_science.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ In this chapter you will learn how to:
- use dynamic reporting
- ensure data and code retention


## Tutorial

The scientific method relies on repeated testing of a hypothesis. When dealing with data and formal analysis, one can reduce this problem to the question: could an independent scientist attain the same results given the described methodology, data and code?
Expand Down
12 changes: 12 additions & 0 deletions book/randomforest.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,18 @@ You will learn:
- the principles of a *decision tree*, and purpose of *bagging*
- how decision trees make up a Random Forest

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(caret)
library(recipes)
library(lubridate)
```

## Tutorial

### Decision trees
Expand Down
7 changes: 7 additions & 0 deletions book/regression_classification.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ After completing this tutorial, you will be able to:

Contents of this Chapter are inspired and partly adopted by the excellent book by [Boehmke and Greenwell](https://bradleyboehmke.github.io/HOML/).

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
```

## Tutorial

### Types of models
Expand Down
11 changes: 11 additions & 0 deletions book/supervised_ml_I.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,17 @@ Basic steps of the implementation of supervised machine learning are introduced,

Contents of this Chapter are inspired and partly adopted by the excellent book [Hands-On Machine Learning in R by Boehmke & Greenwell](https://bradleyboehmke.github.io/HOmachine learning/).

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(caret)
library(recipes)
```

## Tutorial

### What is supervised machine learning?
Expand Down
11 changes: 11 additions & 0 deletions book/supervised_ml_II.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@ In the Chapter \@ref(supervisedmli), you learned how the data are pre-processed,

In this chapter, you will learn more about the process of model training, the concept of the *loss*, and how we can chose the right level of model complexity for optimal model generalisability as part of the model training step. This completes your set of skills for your first implementations a supervised machine learning workflow.

## Setup
In this Chapter, we will need the following libraries
```{r results=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(caret)
library(recipes)
```

## Tutorial

### Data and the modelling challenge
Expand Down
Loading

0 comments on commit 2f2b8af

Please sign in to comment.