-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodelling.Rmd
94 lines (71 loc) · 2.82 KB
/
modelling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: "Modelling"
author: "Guillem Hurault"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 5,
fig.width = 12,
dpi = 200)
```
We considered two models:
- a Poisson regression with Elastic Net regularisation ("glmnet")
- a gradient boosting machine ("gbm") with a Poisson distribution
The targets pipeline has alread fitted the models, computed feature importance and generated predictions.
In this report, we just inspect the results.
```{r initialisation}
library(tidyverse)
glmnet_fit <- targets::tar_read(glmnet_fit)
gbm_fit <- targets::tar_read(gbm_fit)
imp <- targets::tar_read(feature_importance)
perf <- targets::tar_read(performance)
```
## Check convergence
First, we inspect whether the two models have converged.
```{r check-fit}
par(mfrow = c(1, 2))
plot(glmnet_fit, main = "glmnet fit\n")
# print(gbm_fit)
gbm::gbm.perf(gbm_fit, method = "cv")
title(main = "gbm fit")
par(mfrow = c(1, 1))
```
## Predictive performance
We assess the predictive performance of the model by generating predictions for the test set and comparing the predictions to the true values using the RMSE and MAE (the lower the better).
We also compute the performance of an intercept-only model to serve as a reference.
```{r predictive-performance}
perf %>%
pivot_longer(!Model, names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Model, y = Value)) +
facet_grid(rows = vars(Metric)) +
geom_col() +
scale_y_continuous(expand = expansion(mult = c(0, .05))) +
coord_flip() +
labs(x = "", y = "") +
theme_bw(base_size = 15)
```
Here, the glmnet model performs better than the gbm model.
## Feature identification
We define the importance of features by:
- absolute value of coefficients for the glmnet model
- relative importance for the gbm model
We assess the identification of importance features by computing the Spearman rank correlation coefficient between the importance of the features given by the model and the true importance of the features (absolute value of coefficients used to generate the data).
The higher the correlation the better.
```{r feature-importance}
tibble(Model = c("glmnet", "gbm"),
Correlation = c(cor(imp$True, imp$glmnet, method = "spearman"),
cor(imp$True, imp$gbm, method = "spearman"))) %>%
ggplot(aes(x = Model, y = Correlation)) +
geom_col() +
scale_y_continuous(expand = expansion(mult = c(0, .05))) +
coord_flip() +
labs(x = "", y = "Spearman correlation with true importance of parameters") +
theme_bw(base_size = 15)
```
Here, the glmnet model better identifies the most important features than the gbm model.