-
Notifications
You must be signed in to change notification settings - Fork 62
/
Copy path25-elnet.Rmd
222 lines (156 loc) · 7.88 KB
/
25-elnet.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Elastic Net
```{r elnet_opts, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
options(digits = 4)
```
We again use the `Hitters` dataset from the `ISLR` package to explore another shrinkage method, **elastic net**, which combines the *ridge* and *lasso* methods from the previous chapter.
```{r}
data(Hitters, package = "ISLR")
Hitters = na.omit(Hitters)
```
We again remove the missing data, which was all in the response variable, `Salary`.
```{r}
tibble::as_tibble(Hitters)
```
```{r}
dim(Hitters)
```
Because this dataset isn't particularly large, we will forego a test-train split, and simply use all of the data as training data.
```{r, message = FALSE, warning = FALSE}
library(caret)
library(glmnet)
```
Since he have loaded `caret`, we also have access to the `lattice` package which has a nice histogram function.
```{r}
histogram(Hitters$Salary, xlab = "Salary, $1000s",
main = "Baseball Salaries, 1986 - 1987")
```
## Regression
Like ridge and lasso, we again attempt to minimize the residual sum of squares plus some penalty term.
$$
\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda\left[(1-\alpha)||\beta||_2^2/2 + \alpha ||\beta||_1\right]
$$
Here, $||\beta||_1$ is called the $l_1$ norm.
$$
||\beta||_1 = \sum_{j=1}^{p} |\beta_j|
$$
Similarly, $||\beta||_2$ is called the $l_2$, or Euclidean norm.
$$
||\beta||_2 = \sqrt{\sum_{j=1}^{p} \beta_j^2}
$$
These both quantify how "large" the coefficients are. Like lasso and ridge, the intercept is not penalized and `glmnet` takes care of standardization internally. Also reported coefficients are on the original scale.
The new penalty is $\frac{\lambda \cdot (1-\alpha)}{2}$ times the ridge penalty plus $\lambda \cdot \alpha$ times the lasso lasso penalty. (Dividing the ridge penalty by 2 is a mathematical convenience for optimization.) Essentially, with the correct choice of $\lambda$ and $\alpha$ these two "penalty coefficients" can be any positive numbers.
Often it is more useful to simply think of $\alpha$ as controlling the mixing between the two penalties and $\lambda$ controlling the amount of penalization. $\alpha$ takes values between 0 and 1. Using $\alpha = 1$ gives the lasso that we have seen before. Similarly, $\alpha = 0$ gives ridge. We used these two before with `glmnet()` to specify which to method we wanted. Now we also allow for $\alpha$ values in between.
```{r}
set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)
```
We first setup our cross-validation strategy, which will be 5 fold. We then use `train()` with `method = "glmnet"` which is actually fitting the elastic net.
```{r}
hit_elnet = train(
Salary ~ ., data = Hitters,
method = "glmnet",
trControl = cv_5
)
```
First, note that since we are using `caret()` directly, it is taking care of dummy variable creation. So unlike before when we used `glmnet()`, we do not need to manually create a model matrix.
Also note that we have allowed `caret` to choose the tuning parameters for us.
```{r}
hit_elnet
```
Notice a few things with these results. First, we have tried three $\alpha$ values, `0.10`, `0.55`, and `1`. It is not entirely clear why `caret` doesn't use `0`. It likely uses `0.10` to fit a model close to ridge, but with some potential for sparsity.
Here, the best result uses $\alpha = 0.10$, so this result is somewhere between ridge and lasso, but closer to ridge.
```{r}
hit_elnet_int = train(
Salary ~ . ^ 2, data = Hitters,
method = "glmnet",
trControl = cv_5,
tuneLength = 10
)
```
Now we try a much larger model search. First, we're expanding the feature space to include all interactions. Since we are using penalized regression, we don't have to worry as much about overfitting. If many of the added variables are not useful, we will likely use a model close to lasso which makes many of them 0.
We're also using a larger tuning grid. By setting `tuneLength = 10`, we will search 10 $\alpha$ values and 10 $\lambda$ values for each. Because of this larger tuning grid, the results will be very large.
To deal with this, we write a quick helper function to extract the row with the best tuning parameters.
```{r}
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
```
We then call this function on the trained object.
```{r}
get_best_result(hit_elnet_int)
```
We see that the best result uses $\alpha = 1$, which makes since. With $\alpha = 1$, many of the added interaction coefficients are likely set to zero. (Unfortunately, obtaining this information after using `caret` with `glmnet` isn't easy. The two don't actually play very nice together. We'll use `cv.glmnet()` with the expanded feature space to explore this.)
Also, this CV-RMSE is better than the lasso and ridge from the previous chapter that did not use the expanded feature space.
We also perform a quick analysis using `cv.glmnet()` instead. Due in part to randomness in cross validation, and differences in how `cv.glmnet()` and `train()` search for $\lambda$, the results are slightly different.
```{r}
set.seed(42)
X = model.matrix(Salary ~ . ^ 2, Hitters)[, -1]
y = Hitters$Salary
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
sqrt(fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.min]) # CV-RMSE minimum
```
The commented line is not run, since it produces a lot of output, but if run, it will show that the fast majority of the coefficients are zero! Also, you'll notice that `cv.glmnet()` does not respect the usual predictor hierarchy. Not a problem for prediction, but a massive interpretation issue!
```{r}
#coef(fit_lasso_cv)
sum(coef(fit_lasso_cv) != 0)
sum(coef(fit_lasso_cv) == 0)
```
## Classification
Above, we have performed a regression task. But like lasso and ridge, elastic net can also be used for classification by using the deviance instead of the residual sum of squares. This essentially happens automatically in `caret` if the response variable is a factor.
We'll test this using the familiar `Default` dataset, which we first test-train split.
```{r}
data(Default, package = "ISLR")
```
```{r}
set.seed(42)
default_idx = createDataPartition(Default$default, p = 0.75, list = FALSE)
default_trn = Default[default_idx, ]
default_tst = Default[-default_idx, ]
```
We then fit an elastic net with a default tuning grid.
```{r}
def_elnet = train(
default ~ ., data = default_trn,
method = "glmnet",
trControl = cv_5
)
def_elnet
```
Since the best model used $\alpha = 1$, this is a lasso model.
We also try an expanded feature space, and a larger tuning grid.
```{r}
def_elnet_int = train(
default ~ . ^ 2, data = default_trn,
method = "glmnet",
trControl = cv_5,
tuneLength = 10
)
```
Since the result here will return 100 models, we again use are helper function to simply extract the best result.
```{r}
get_best_result(def_elnet_int)
```
Here we see $\alpha = 0.1$, which is a mix, but close to ridge.
```{r}
calc_acc = function(actual, predicted) {
mean(actual == predicted)
}
```
Evaluating the test accuracy of this model, we obtain one of the highest accuracies for this dataset of all methods we have tried.
```{r}
# test acc
calc_acc(actual = default_tst$default,
predicted = predict(def_elnet_int, newdata = default_tst))
```
## External Links
- [`glmnet` Web Vingette](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) - Details from the package developers.
- [`glmnet` with `caret`](https://github.com/topepo/caret/issues/116) - Some details on Elastic Net tuning in the `caret` package.
## `rmarkdown`
The `rmarkdown` file for this chapter can be found [**here**](25-elnet.Rmd). The file was created using `R` version `r paste0(version$major, "." ,version$minor)`. The following packages (and their dependencies) were loaded when knitting this file:
```{r, echo = FALSE}
names(sessionInfo()$otherPkgs)
```