-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmixedModelML.Rmd
422 lines (300 loc) · 17.6 KB
/
mixedModelML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
title: "Mixed Model Estimation, <br> a Connection to Additive Models,<br>and Beyond..."
author: |
| Michael Clark
| Statistician Lead
| CSCAR, ARC, U of Michigan
date: '`r Sys.Date()`'
output:
html_document:
css: ../other.css
theme: cosmo
toc: yes
toc_float: yes
highlight: pygments
references:
- author:
- family: Wood
given: Simon
id: Wood
issued:
year: 2006
title: Generalized Additive Models.
- author:
- family: Fahrmeir
given: Ludwig
- family: Kneib
given: Thomas
- family: Lang
given: Stefan
- family: Marx
given: Brian
id: Fahrmeir
issued:
year: 2013
title: Regression.
always_allow_html: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message=F)
```
# Introduction
The goal is to show mixed model estimation, express the tie between generalized additive models and mixed models, and open the door to further modeling expansion. The following is based on the @Wood text on additive models, chapter 6 in particular. It assumes familiarity with standard regression from a matrix perspective and at least passing familiarity with mixed models.
<br>
<br>
# Model Formulation
We can start with a standard linear model (SLiM) expressed as follows:
$$\mathbf{y} = \mathbf{Xb} + \mathbf{\epsilon} $$
Here $\mathbf{y}$ is the target variable, $\mathbf{X}$ is a <span title="first column representing the intercept, the rest are the covariates of interest" style="color:#dd4814">model matrix</span>, $\mathbf{b}$ are the coefficients, and error $\mathbf{\epsilon}$. For cleaner presentation (and background coding on my part), note that beyond this point I'll largely refrain from using bold to indicate vectors/matrices, or using subscripts for every *i*<sup>th</sup> observation. Let's just assume we are in a typical data situation involving multiple observations, a univariate vector target variable ($y$), a (design) matrix of predictor variables ($X$) etc. Hopefully the context and code will make things very clear.
Now consider a data situation in which observations are clustered, and so non-independent. For example, we might have data regarding students within schools, and we wish to account for the fact that observations of students within a particular school are likely correlated. We can do this within a mixed model framework. For a mixed model with a single random effect for some grouping factor (e.g. school), the SLiM extends to:
$$y = Xb + Zg + \epsilon$$
Where Z is a design matrix pertaining to the grouping structure. Consider a factor z representing group/cluster membership, this would convert z to the following:
```{r dummycode, echo=FALSE}
z = Z = factor(c('A','A','B','B','C','C'))
Z = model.matrix(~ Z-1)
pander::pander(data.frame(z, Z))
```
The coefficients $g$ are the random effects, assumed $\mathcal{N}(0,\tau^2)$, i.e. normally distributed with zero mean and $\tau$ standard deviation. While we are often interested in these specific effects, they do not have to be estimated directly for modeling purposes.
$$y = Xb + Zg + \epsilon$$
$$g \sim \mathcal{N}(0, \psi_\theta)$$
$$\epsilon \sim \mathcal{N}(0, \Lambda\sigma^2)$$
In this depiction, $\psi_\theta$ can reflect some more interesting dependencies, but in the simple case of a random intercepts model it can be a single variance estimate $\tau^2$. $\Lambda$ can be used to model residual covariance but often is just the identity matrix, with the underlying assumption of constant variance $\sigma^2$ across observations.
We can combine the random effects and residuals into a single construct reflecting the covariance structure of the observations:
$$ e = Zg + \epsilon $$
This makes $\mathbf{e}$ a multivariate normal vector with mean 0 and covariance (**$I$** is the unit matrix):
$$Z\psi_{\theta}Z^\intercal + I\sigma^2$$
This puts us back to a standard linear model:
$$ y = Xb + e$$
$$e \sim \mathcal{N}(0, \Sigma_\theta\sigma^2)$$
where $\Sigma_\theta\ = \frac{Z\psi_{\theta}Z^\intercal}{\sigma^2} + I$.
<br>
<br>
# Maximum Likelihood Estimation
Given where we are now, we can proceed to estimate a mixed model via maximum likelihood. For this we'll use the sleepstudy data from the <span style="color:#1e90ff">lme4</span> package. The data has reaction times for 18 individuals over 10 days each (see the help file for the sleepstudy object for more details).
## Data
```{r dataSetup, cache=TRUE}
data(sleepstudy, package='lme4')
X = model.matrix(~Days, sleepstudy)
Z = model.matrix(~factor(sleepstudy$Subject)-1)
colnames(Z) = paste0('Subject_', unique(sleepstudy$Subject)) # for cleaner presentation later
rownames(Z) = paste0('Subject_', sleepstudy$Subject)
y = sleepstudy$Reaction
```
## ML function
The following is based on the code in Wood (6.2.2), with a couple modifications for consistent nomenclature and presentation. We use <span style="color:#8b0000">optim</span> and a minimizing function, in this case the negative log likelihood, to estimate the parameters of interest, collectively $\theta$, in the code below. The (square root of the) variances will be estimated on the log scale. In Wood, he simply extracts the 'fixed effects' for the intercept and days effects using lm (6.2.3), and we'll do the same.
```{r mlfunc, cache=TRUE}
llMixed = function(y, X, Z, theta){
tau = exp(theta[1])
sigma = exp(theta[2])
n = length(y)
# evaluate covariance matrix for y
e = tcrossprod(Z)*tau^2 + diag(n)*sigma^2
L = chol(e) # L'L = e
# transform dependent linear model to independent
y = backsolve(L, y, transpose=TRUE)
X = backsolve(L, X, transpose=TRUE)
b = coef(lm(y~X-1))
LP = X %*% b
ll = -n/2*log(2*pi) -sum(log(diag(L))) - crossprod(y-LP)/2
-ll
}
```
Here is an alternative function using a multivariate normal approach that doesn't use the transformation to independent, and might provide additional perspective.
```{r mlfuncMV, cache=TRUE}
llMixedMV = function(y, X, Z, theta){
tau = exp(theta[1])
sigma = exp(theta[2])
n = length(y)
# evaluate covariance matrix for y
e = tcrossprod(Z)*tau^2 + diag(n)*sigma^2
b = coef(lm.fit(X, y))
mu = X %*% b
ll = -mvtnorm::dmvnorm(y, mu, e, log=T)
}
```
## Results
Now we're ready to use the <span style="color:#8b0000">optim</span> function for estimation. A slight change to tolerance is included to get closer estimates to <span style="color:#1e90ff">lme4</span>, which we will compare shortly.
```{r optim, cache=TRUE}
paramInit = c(0, 0)
names(paramInit) = c('tau', 'sigma')
modelResults = optim(llMixed, X=X, y=y, Z=Z, par=paramInit, control=list(reltol=1e-10))
modelResultsMV = optim(llMixedMV, X=X, y=y, Z=Z, par=paramInit, control=list(reltol=1e-10))
rbind(c(exp(modelResults$par), negLogLik = modelResults$value, coef(lm(y~X-1))),
c(exp(modelResultsMV$par), negLogLik = modelResultsMV$value, coef(lm(y~X-1)))) %>%
round(2)
```
As we can see, both formulations produce identical results. We can now compare those results to the <span style="color:#1e90ff">lme4</span> output for the same model, and see that we're getting what we should.
```{r lme, cache=TRUE}
library(lme4)
lmeMod = lmer(Reaction ~ Days + (1|Subject), sleepstudy, REML=FALSE)
lmeMod
```
We can also predict the random effects (Wood, 6.2.4), and after doing so again compare the results to the <span style="color:#1e90ff">lme4</span> estimates.
```{r estRanEf, cache=TRUE}
tau = exp(modelResults$par)[1]
tausq = tau^2
sigma = exp(modelResults$par)[2]
sigmasq = sigma^2
Sigma = tcrossprod(Z)*tausq/sigmasq + diag(length(y))
ranefEstimated = tausq*t(Z)%*%solve(Sigma) %*% resid(lm(y~X-1))/sigmasq
data.frame(ranefEstimated, lme4 = ranef(lmeMod)$Subject[[1]]) %>% round(2)
```
## Issues with ML estimation
Situations arise in which using maximum likelihood for mixed models would result in notably biased estimates (e.g. small N, lots of fixed effects), and so it is typically not used. Standard software usually defaults to *restricted* maximum likelihood. However, our purpose here has been served, so we will not dwell further on mixed model estimation.
## Link with penalized regression
A link exists between mixed models and a penalized likelihood approach. For a penalized approach with the SLiM, the objective function we want to minimize can be expressed as follows:
$$ \lVert y- X\beta \rVert^2 + \beta^\intercal\beta $$
The added component to the sum of the squared residuals is the penalty. By adding the sum of the squared coefficients, we end up keeping them from getting too big, and this helps to avoid <span title="in high dimensional cases it helps with underfitting" style="color:#dd4814">overfitting</span>. Another interesting aspect of this approach is that it is comparable to using a <span title="normal with zero mean and some constant variance" style="color:#dd4814">specific prior</span> on the coefficients in a Bayesian framework.
We can now see mixed models as a penalized technique. If we knew $\sigma$ and $\psi_\theta$, then the predicted random effects $g$ and estimates for the fixed effects $\beta$ are those that minimize the following objective function:
$$ \frac{1}{\sigma^2}\lVert y - X\beta - Zg \rVert^2 + g^\intercal\psi_\theta^{-1}g $$
We'll take this penalized approach explicitly with the generalized additive model below.
<br>
<br>
# Additive model as a mixed model
At this point I'd like to demonstrate some concepts from section 6.6 in Wood. Conceptually, the take home idea is that an additive model, or generalized additive model (GAM), can be seen as a mixed model, which is interesting in and of itself (at least to me), but it also means that GAMs meld nicely with mixed models to form a more general modeling framework. For an introduction to additive models, one can see my [document](https://sites.google.com/a/umich.edu/micl/miscfiles/GAMS.pdf), which is more or less an overview of Wood's text.
## Data set up
The <span title="See Wood 3.2." style="color:#dd4814">data</span> regards motor engine size and wear in 19 Volvos, with the initial assumption that larger capacity engines will wear out less quickly.
```{r gamData, cache=TRUE}
size = c(1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98)
wear = c(4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7)
x = size - min(size)
x = x / max(x)
d = data.frame(wear, x)
```
## Relevant functions
We'll create functions for the cubic spline operation, the creation of a model matrix, the creation of the penalty matrix, and finally the fitting function.
```{r gamFuncs, cache=TRUE}
# function for cubic spline on [0,1]; requires data and points within domain for knots
rk <- function(x, z) {
((z-0.5)^2 - 1/12) * ((x-0.5)^2 - 1/12) / 4 -
((abs(x-z)-0.5)^4 - (abs(x-z)-0.5)^2/2 + 7/240) / 24
}
# create the model matrix
splineX <- function(x, knots) {
q <- length(knots) + 2 # number of parameters
n <- length(x) # number of observations
X <- matrix(1, n, q) # initialized model matrix
X[, 2] <- x # set second column to x
X[, 3:q] <- outer(x, knots, FUN = rk) # remaining to cubic spline
X
}
# set up the regression spline penalty matrix, given knot sequence knots
Sfunc = function(knots){
q = length(knots)+2
S = matrix(0, q, q) # initialize
S[3:q, 3:q] = outer(knots, knots, FUN=rk)
S
}
# Matrix sqrt function
matSqrt = function(S){
UDU = eigen(S, symmetric=TRUE)
U = UDU$vectors
D = diag(UDU$values)
B = crossprod(U) %*% sqrt(D)
B
}
# the fitting function with lambda smoothing parameter
prsFit <- function(y, x, knots, lambda) {
q = length(knots) + 2 # dimension of basis
n = length(x) # number of observations
Xa = rbind(splineX(x, knots),
matSqrt(Sfunc(knots)) * sqrt(lambda)) # augmented model matrix
y[(n + 1):(n + q)] = 0 # augment the data vector
lm(y ~ Xa - 1) # fit and return penalized regression spline
}
```
## Fit a penalized model
Now we can fit the model and visualize. There does not seem to be a straightforward relationship between engine size and engine wear.
```{r fitCSgam, cache=TRUE, fig.align='center'}
x_knots = 1:7/8 # choose some knots
mod = prsFit(y=wear, x=x, knots=x_knots, lambda=.0001) # fit the penalized spline
x_pred = 0:100/100 # values for prediction
Xp = splineX(x_pred, x_knots) # create design matrix
predictions = Xp %*% coef(mod)
plot(x, wear, xlab='Scaled Engine size', ylab='Wear Index', pch=19,
col="#dd4814", cex=.75, col.axis='gray50', bty='n')
lines(x_pred, predictions, col='#1e90ff')
```
## As a mixed model
One can use the result of eigen decomposition on the penalty matrix to ultimately produce a re-parameterization of the original matrix as fixed and random effects components.
```{r gam2mixed, cache=TRUE}
S = Sfunc(x_knots)
init = eigen(S)
U = init$vectors
D = diag(init$values)
poseigen = which(diag(D) > 0)
Dpos = D[poseigen, poseigen] # smallest submatrix containing all positive values
Xf = splineX(x, knots = x_knots) # spline model matrix
U_F = U[, (ncol(U)-1):ncol(U)] # partition eigenvector matrix
U_R = U[, 1:(ncol(U)-ncol(U_F))]
X_F = Xf %*% U_F # fixed part with B_F coef to be estimated (not penalized)
X_R = Xf %*% U_R # random part with B_R random effects
Z = X_R %*% sqrt(Dpos)
```
The above operations have effectively split the GAM into fixed and random parts:
$$\mathbf{X}_F\mathbf{\beta}_F + \mathbf{X}_R\mathbf{b}_R$$
$$\mathbf{b}_R\sim \mathcal{N}(\mathbf{0}, \mathbf{D_+^{-1}}/\lambda)$$
Here $\lambda$ is a smoothing parameter, which controls the amount of smoothing. For a penalized spline, the loss function is:
$$\lVert y - X\beta \rVert^2 + \lambda\beta^{\intercal}S\beta ,$$
with the second part the added penalty. As $\lambda \rightarrow \infty$, we essentially get straight line fit, while a $\lambda$ of 0 would be the same as an unpenalized fit.
The Z in the code above represents part of the mixed model Z we had before in the standard setting. We can now represent our model as follows:
$$\mathbf{X}_F\mathbf{\beta}_F + \mathbf{Zg}$$
$$\mathbf{g} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}/\lambda) $$
To incorporate a gamm, i.e. a *generalized additive mixed model*, the $X_F$ above would be appended to the 'fixed' effect design matrix, while Z would be added to the random effects component, and estimation would proceed normally as for a mixed model.
The <span style="color:#1e90ff">mgcv</span> and associated packages allow us to see this explicitly. Initially we'll just duplicate a standard mixed model using <span style="color:#1e90ff">mgcv</span> and <span style="color:#1e90ff">gamm4</span> packages. One can see that the gamm4 is using lme4 and produces the same output.
```{r gamSleepStudy, message=FALSE, cache=TRUE}
library(mgcv); library(gamm4)
modGam = gamm4(Reaction ~ Days, random=~(1|Subject), data=sleepstudy)
summary(modGam$mer)
```
Now we'll add a cubic spline for the effect of Days, and we can see the smooth term listed as part of the random effects results.
```{r gammSleepStudy, cache=TRUE, fig.align='center', results='hold', fig.show='hold'}
modGamS = gamm4(Reaction ~ s(Days, bs='cs'), random=~(1|Subject), data=sleepstudy)
summary(modGamS$mer)
# summary(modGamS$gam)
plot(modGamS$gam)
```
<br>
<br>
# Wrap Up
So we've seen that GAMs can be seen as having a fixed and random effect as in mixed models, which means we can use them within the mixed model framework. Another way to look at this is that we have added the capacity to examine nonlinear relationships and other covariance structures to the standard mixed model framework, making a very flexible modeling approach even more so.
## It's full of STARs...
However, this goes even further. ***St****ructured* ***A****dditive* ***R****egression models* (STAR) build upon the notion just demonstrated (nicely illustrated in @Fahrmeir). Not only we can combine mixed and additive models, but other model classes as well. This includes interactions and varying coefficient models, spatial effects, gaussian processes/kriging, markov random fields and others. Each model class has a particular design and penalty matrix as we saw in the examples previously, but otherwise can be added to the current model being considered. We can also use other techniques, like <span title="which itself can be seen as an additive modeling approach!" style="color:#dd4814">boosting</span> to estimate them. Standard regression, glm, gam, etc. are thus special cases of the STAR model. In short, we now have a highly flexible modeling environment within which to tackle the questions we seek to answer with our data.
```{r diag, echo=FALSE, fig.align='center'}
pal = RColorBrewer::brewer.pal(4, 'Accent')
DiagrammeR::grViz("
digraph boxes_and_circles {
# graph statement
graph [layout = dot,
rankdir = TB]
# node statements
node [fontname = Helvetica,
fontcolor = white,
width = .5,
style = filled]
node [shape = circle,
color = '#ff5503'] // sets as circles
STAR
node [shape = box,
color = '#1f78b4']
Others
node [color = '#e31a1c']
GAMM Mixed, GAM
node [color = '#33a02c']
GLM, SLiM
subgraph {
rank = same; Mixed; GAM;
}
# edge statements
STAR -> GAMM [arrowhead=dot, color=gray]
STAR -> Others [arrowhead=dot, color=gray]
GAMM -> Mixed [arrowhead=dot, color=gray]
GAMM -> GAM [arrowhead=dot, color=gray]
GAM -> GLM [arrowhead=dot, color=gray]
Mixed -> GLM [arrowhead=dot, color=gray]
GLM -> SLiM [arrowhead=dot, color=gray]
GAM -> Mixed [arrowhead=none, color=gray]
}
", width=900)
```
# References