-
Notifications
You must be signed in to change notification settings - Fork 58
/
LMM_Comparison.rmd
1342 lines (1025 loc) · 45.7 KB
/
LMM_Comparison.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Linear Mixed Models (LMMs) - Model Comparisons"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
Download the raw `R` markdown code here
[https://jwiley.github.io/MonashHonoursStatistics/LMM_Comparison.rmd](https://jwiley.github.io/MonashHonoursStatistics/LMM_Comparison.rmd).
These are the `R` packages we will use.
```{r setup}
options(digits = 5)
library(data.table)
library(JWileymisc)
library(extraoperators)
library(lme4)
library(lmerTest)
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)
library(haven)
## load data collection exercise data
## merged is a a merged long dataset of baseline and daily
dm <- as.data.table(read_sav("Merged.sav"))
```
# Comparing Models
For many statistical models, including LMMs, it can be informative to
compare different models. Comparing different models can be used in
lots of different ways. Here are some examples, although they are not
meant to be exhaustive.
* Evaluate which of two (or more) models provides the best fit to the
data
* Evaluate / compare how the results for a particular predictor(s) of
interest change across two (or more) models
* Calculate the significance of multiple predictors
* Calculate effect sizes for one or multiple predictors
We will look at examples of the different uses of model comparisons in
this topic. In fact, we already saw one example in the topic on
Moderation where we compared the results, by eye, from two models that
differed only in whether some extreme values were included or
excluded. This is a type of model comparison, and what we learned in
that example was that the results of the model
**were not sensitive to the two extreme values**.
Just as there are many different kinds of models we can fit, even with
LMMs (e.g., with or without random slopes, etc.), so to there are many
different kinds and purposes for different model comparisons.
To begin with, let's imagine we are just trying to compare two models:
Model A and Model B. We can broadly classify the type of comparison
based on whether Model A and B are nested or non-nested models. We
will talk about what these mean next.
## Nested Models
Models are considered nested when one model is a restricted or
constrained version of another model. For example, suppose that
**Model A** predicts mood from stress and age whereas
**Model B** predicts mood from age only. Written as a formula,
these could be:
$$
ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij}
$$
and
$$
ModelB: mood_{ij} = b_{0j} + b1 * age_j + 0 * stress_{ij} + \varepsilon_{ij}
$$
In **Model B**, I purposely used $0 * stress_{ij}$ to highlight that
when a predictor is left out of a model, it is the same as fixing
(sometimes also called constraining) the coefficient ($b2$ in this
case) to be exactly 0. In this case, we would say that
**Model B** is nested within **Model A**. In other words,
**Model A** contains every predictor and parameter in **Model B** plus
more.
Restricted
This idea is similar to the idea of nested data used in LMMs, but the
difference is that we are not talking about observations or
data, rather we are talking about the parameters of a model.
To summarize, briefly, we say that **Model B** is *nested* within
**Model A** if:
* setting one or more parameters in **Model A** to 0 yields the same
model as **Model B**.
* both models use the same data, have the same number of observations,
etc. Even if the same dataset is used in `R`, this does not
guarantee the same data are used because if additional predictors
are included in **Model A** and these extra predictors have some
missing data, by default `R` would drop the missing data and result
in a subset of cases being used for **Model A** than for **Model B**
If two models are nested, then we have the most options in terms of
comparing the two models. For example, we can evaluate whether
**Model A** is a statistically significantly better fit than is
**Model B** using a Likelihood Ratio Test (LRT)^[https://en.wikipedia.org/wiki/Likelihood-ratio_test].
We can compare the fit of each model and use the difference in fit to
derive effect sizes. We also can attribute any differences in the two
models to the parameter(s) that have been constrainted to 0 in **Model
B** from **Model A**.
A simple definition of the LRT test statistic, $\lambda$ is based on
two times the difference in the log likelihoods.
$$
\lambda = -2 * (LL_B - LL_A)
$$
You may wonder why the likelihood *ratio* test can be conducted by
taking the difference in the log likelihoods. It is because the log of
a ratio is the same as the difference in the logs of the numerator and
denominator.
$$
log_{e}\left(\frac{6}{2}\right) = log_{e}(6) - log_{e}(2)
$$
which we can confirm is true in `R`:
```{r}
## log of ratio
log(6/2)
## difference in logs
log(6) - log(2)
```
If the null hypothesis of no difference is true in the population,
then $\lambda$ will follow a chi-squared distribution with degrees of
freedom equal to the number of parameters constrained to 0 in **Model
B** from **Model A**, the difference in degrees of freedom used for
each model, that is:
$$
\lambda \sim \chi^2(DF_A - DF_B)
$$
Thus we often use a chi-square distribution in the LRT to look up the
p-value. Finally, note that because LRTs are based on the log
likelihoods (LL) from a model, we need true log likelihoods for the
LRT to be valid. Therefore, we cannot use restricted maximum
likelihood, we need to use maximum likelihood estimation.
### Nested Models in `R`
To see the idea of nested models and the LRT in action, let's examine
a concrete example in `R`. Here are two LMMs corresponding to
**Model A** and **Model B** formula we wrote previously. We can see in
the `R` code that the models are nested, the only difference is
`age`. We set `REML = FALSE` to get maximum likelihood estimates so
that we get true log likelihoods. We also need to confirm that the two
models are based on the same number of observations. We can extract
just this information in `R` using the `nobs()` function. This let's
us confirm that the two models are fitted to the same data. For
example, if stress had some missing data that was not missing age or
mood, it could be that **Model A** is based on fewer observations than
**Model B**.
```{r}
modela <- lmer(mood ~ age + stress + (1 | ID), data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + (1 | ID), data = dm, REML = FALSE)
nobs(modela)
nobs(modelb)
```
In this case, we can see that the number of observations are
identical. Now we can find the log likelihoods of both models by using
the `logLik()` function.
```{r}
logLik(modela)
logLik(modelb)
```
Now we have the log likelihoods (LL) of each model and the degrees of
freedom from both models. From this, we can calculate $\lambda$ and
then look up the p-value for a chi-square distribution with $\lambda$
and 1 degree of freedom (1 is the difference in degrees of freedom
between Model A and B). To get the p-value from a chi-square, we use
the `pchisq()` function.
```{r}
## lambda
-2 * (-290.86 - -281.71)
## p-value from a chi-square
pchisq(18.3, df = 1, lower.tail = FALSE)
```
In practice, we do not need to do these steps by hand, we can get a
test of two nested models in `R` using the `anova()` function (which
in this case is not actually analyzing the variance really). We use
the `anova()` function with `test = "LRT"` to get a likelihood ratio
test.
```{r}
anova(modela, modelb, test = "LRT")
```
If you compare the chi-square value, degrees of freedom and p-value,
you'll see that they basically match what we calculated by hand. The
small differences are due to rounding (`R` will use more decimals of
precision whereas we only used two decimals).
In this simple case, we are only testing a single parameter, the fixed
regression coefficient for stress, because that is the only parameter
that differs between **Model A** and **Model B**. Thus in this case,
it would be easier to rely on the t-test we get from a summary of the
model.
```{r}
summary(modela)
```
In this instance, the LRT and the t-test actually yield equivalent
p-values, 1.9e-05, however, this does not have to be. The LRT is based
on slightly different theory than the t-test and the t-test uses
approximate degrees of freedom for the t-distribution based on the
Satterthwaite method, whereas the LRT does not directly incorporate
the sample size in the same way. The two methods will be
assymptotically equivalent (at very large sample sizes) but can
differ, particularly for smaller samples.
In practice, we wouldn't usually use a LRT to evaluate whether a
single model parameter is statistically significant. The benefit of
(nested) model comparisons is that it allows us to compare two
models. Those models can be quite different.
Here are another two models, this time they differ by two predictors,
energy and female. The LRT now tests whether the addition of energy
and female results in significantly better fit for **Model A** than
**Model B**.
```{r}
modela <- lmer(mood ~ age + stress + energy + female + (1 | ID),
data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + stress + (1 | ID), data = dm,
REML = FALSE)
nobs(modela)
nobs(modelb)
anova(modela, modelb, test = "LRT")
```
In this case, we can see from the significant p-value that **Model A**
is a significantly better fit to the data than is **Model B**.
Note that LRTs are only appropriate when the models are nested. We
cannot use LRTs for non-nested models.
In nested models, the more complex model is often called the "Full"
model and the simpler model the "Reduced" or "Restricted" version of
the Full model.
## Non Nested Models
Models are considered non nested when one model is not strictly a
constrained version of a more complex model. For example, suppose that
**Model A** predicts mood from stress and age whereas
**Model B** predicts mood from age and female. Written as a formula,
these could be:
$$
ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij}
$$
and
$$
ModelB: mood_{ij} = b_{0j} + b1 * age_j + b2 * female_j + 0 * stress_{ij} + \varepsilon_{ij}
$$
Although **Model B** does have age which also is present in **Model
A** and its restricted stress to be 0, it has another addition,
female, which is not in **Model A**. In this case, the two models are
not nested. In this case, although we could fit both models and they
are both on the same number of observations, so the outcome is the
same in both models, the models are not nested. Although we can still
ask `R` to conduct a LRT, this LRT is not valid. It is shown here to
highlight that you as the analyst are responsible for determining
whether your two models are nested or not and therefore deciding
whether a LRT is an appropriate way to evaluate whether the models are
significantly different from each other, or not.
```{r}
modela <- lmer(mood ~ age + stress + (1 | ID),
data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + female + (1 | ID), data = dm,
REML = FALSE)
nobs(modela)
nobs(modelb)
## this is NOT appropriate LRT
anova(modela, modelb, test = "LRT")
```
Although we cannot conduct a LRT on non nested models, it *is* still
useful to compare the fit of non-nested models. For example, if one is
a much better fit than another model, that may suggest one set of
predictors is superior or can be used to evaluate competing hypotheses
(e.g., Theory 1 says that stress and family history are the most
important predictors of mood and Theory 2 says that age and sleep are
the best predictors of mood --- these two theories are competing, not
nested versions of each other).
We cannot use LRTs to compare non nested models, but we can use other
measures, including performance measures such as variance explained or
model accuracy and information criterion. We will talk about
information criterion next.
## Information Criterion
Two common information criterion are:
* AIC: the Akaike Information Criterion
(AIC)^[https://en.wikipedia.org/wiki/Akaike_information_criterion]
* BIC: the Bayesian Information Criterion
(BIC)^[https://en.wikipedia.org/wiki/Bayesian_information_criterion]
also sometimes called the Schwarz Information Criterion
Both the AIC and BIC are calculated based primarily on the log
likelihood, LL, of a model. One way of thinking about these
information criterion is that you could think about models being a way
of approximating reality. Suppose you have two models that both
generate approximations (predictions) of reality. The model whose
predictions are closer to the observed data will have a higher LL.
The LL can be used as a *relative* measure of model fit. LL **is not**
an absolute measure of fit. That is, we do not interpret a specific LL
value as indicating "good" fit. Only which of a set of (all
potentially bad) models is the best.
However, there is a limitation with using LL alone. The LL will always
stay the same or increase as additional predictors / parameters are
added to the model. Thus if we use the LL alone, out of a set of
competing models, we would virtually always pick the more complex
models. To address this, we need to incorporate some penalty for the
complexity of a model, so that to choose a more complex over simpler
model as the "best" it has to improve the LL *enough*.
The AIC and BIC are very similar except that they use different
penalties for model complexity (technically they are derived from
rather different theoretical foundations, but for practical purposes
they are similar other than the complexity penalties).
A common way of defining model complexity is based on the number of
estiamted model parameters, $k$. For example, consider the following
simple linear regression for $n$ different observations:
$$
\begin{align}
\eta_{i} &= b_0 + b_1 * x_i\\
y_i &\sim \mathcal{N}(\eta_i, \sigma_{\varepsilon_i})\\
\end{align}
$$
The parameters are:
$$
b_0, b_1, \sigma_{\varepsilon_i}
$$
so $k = 3$.
The equations for AIC and BIC are quite easy to follow and looking at
them helps understand where they are similar and different.
$$
\begin{align}
AIC &= 2 * k - 2 * LL \\
BIC &= log_{e}(n) * k - 2 * LL \\
\end{align}
$$
$n$ is the number of observations included in the model and $LL$ is
the log likelihood, where higher values indicate a better fitting
model. These equations highlight that the only difference between the
AIC and BIC is whether the number of parameters in the model, $k$ are
multiplied by $2$ (AIC) or $log_{e}(n)$. Thus, the AIC and BIC will be
identical if:
$$
\begin{align}
log_{e}(n) &= 2 \\
n &= e^2 \approx 7.39 \\
\end{align}
$$
If $n < e^2 \approx 7.39$ then the BIC will have a weaker penalty
based on the number of parameters, $k$, than the AIC.
If $n > e^2 \approx 7.39$ then the BIC will have a stronger penalty
based on the number of parameters, $k$, than the AIC.
Functionally, a stronger penalty on the number of parameters means
that a particular information criterion will tend to favor more
parsimonious (less complex) models. Thus, for all but the tiniest of
sample sizes ($\approx 7.39$) the BIC will have a stronger penalty on
model complexity and so will favor relatively more parsimonious models
than will AIC.
For both AIC and BIC, the *relatively* better model of those compared
is the model with the lower value (i.e., lower = better for both AIC
and BIC).
**There is no "right" or "wrong" information criterion to use.**
People use both the AIC and/or the BIC. If both AIC and BIC suggest
the same model is the "best" there is no ambiguity. Sometimes the AIC
and BIC disagree regarding which model is the best. In these cases one
must pick which information criterion to go with. Both criterion
require that the number of observations, $n$, be larger than the
number of parameters, $k$ for them to operate well.
A benefit of the AIC and BIC is that both can be used to compare non
nested models and they also can be used to compare nested models. It
is relatively common to use AIC and/or BIC to "choose" amongst a set
of possible models.
In `R` we can usually calculate AIC and BIC using the functions,
`AIC()` and `BIC()`. Here we will calculate the AIC and BIC for our
two models.
```{r}
AIC(modela, modelb)
BIC(modela, modelb)
```
In this case, both the AIC and BIC are lower for **Model A** than for
**Model B**, indicating that **Model A** is the optimal of those two
models. Again, the relative interpretation is important. We cannot
conclude that **Model A** is a "good" model, only that it is better
than **Model B**.
## Model Selection
Integrating what we have learned about LRT for nested models and
AIC/BIC for nested or non nested models, we can compare across several
models to select the best model.
First, we fit a series of models. In all cases, mood is the
outcome. First we have an intercept only model (m0), an unconditional
model, as a baseline reference. This is helpful as all the fit indices
are relative. Then we fit polynomials of stress with degree 1 (linear;
m1); degree 2 (quadratic; m2); degree 3 (cubic; m3).
**If you are not familiar with polynomials, please
see the extra section at the end of this lecture now.**
We fit different degree polynomials of stress using the `poly()`
function in `R`. Finally we fit a competing model, maybe a linear
effect of energy is a better predictor of mood than stress.
Next, we check that all the observations are identical across
models.
**If the observations were not the same across models, we would need
to create a new dataset that had any missing data on any variable used
in any of the models excluded.** This is a critical step if needed,
because LRTs and AIC and BIC are only valid if based on the same
data.
```{r}
dm[, stress := as.numeric(stress)]
dm[, mood := as.numeric(mood)]
m0 <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)
m1 <- lmer(mood ~ poly(stress, 1) + (1 | ID), data = dm, REML = FALSE)
m2 <- lmer(mood ~ poly(stress, 2) + (1 | ID), data = dm, REML = FALSE)
m3 <- lmer(mood ~ poly(stress, 3) + (1 | ID), data = dm, REML = FALSE)
malt <- lmer(mood ~ energy + (1 | ID), data = dm, REML = FALSE)
## check all the observations are the same
nobs(m0)
nobs(m1)
nobs(m2)
nobs(m3)
nobs(malt)
```
Now we can see which model is the best fit. For the nested models
(m0 - m3) we can use LRTs. Technically, m0 also is nested in malt, so
we can use a LRT for that too. We cannot use a LRT for, say, m1 and
malt as those mdoels are not nested. We can use AIC and BIC for all
models, though. Note that with multiple models, the `anova()` function
is doing sequential LRT tests (e.g., m0 vs m1; m1 vs m2, etc.) not all
compared to one model. If you want two specific models compared,
specify just two.
```{r}
#### LRTs for nested models ####
## two specific comparisons
anova(m0, m3, test = "LRT")
## sequential comparisons
anova(m0, m1, m2, m3, test = "LRT")
anova(m0, malt, test = "LRT")
## AIC and BIC for nested and non nested models
AIC(m0, m1, m2, m3, malt)
BIC(m0, m1, m2, m3, malt)
```
From the LRT we can see that m3 is significantly better fit than
m0. Looking at the sequential tests, however, m2 is no better than m1
and m3 is no better than m2, which might suggest m1 is the best
model. If we look at the AIC values, for the models with stress, AIC
selects m2 as the best model (by .05 compared to m1, very close). BIC
favors more parsimonious models and selects m1 over m2.
However, both AIC and BIC indicate that the alternate model (malt) is
the best of all the models evaluated, suggesting that linear energy is a
better predictor of mood than is linear, quadratic, or cubic
polynomials of stress.
If we were picking a stress model only, we would pick between the
linear (m1; LRT, BIC) and quadratic (m2; AIC) model, depending which
method we used. If we were willing to pick energy over stress, we
would clearly go with energy.
In addition to testing fixed effects, these same methods can be used
to test random effects. Here we will fit a new model, m4, that include
a random slope of stress and the correlation between the random
intercept and stress slope. Then we compare the linear stress with
random slope (m4), linear stress without random slope (m1) and
intercept only model (m0).
```{r}
m4 <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)
## check all the observations are the same
nobs(m0)
nobs(m1)
nobs(m4)
anova(m0, m1, test = "LRT")
anova(m1, m4, test = "LRT")
anova(m0, m4, test = "LRT")
AIC(m0, m1, m4)
BIC(m0, m1, m4)
```
From these tests we can conclude several things. Firstly adding a
linear trend of stress as a fixed effect is signfiicantly better than
the intercept only model. Secondly, adding a random slope of stress
and correlation of the random slope and intercept is not significantly
better than the fixed linear effect of stress model. Thirdly, the
model with stress as both a fixed and random slope is signficantly
better than the intercept only model (m4 vs m0 is kind of an omnibus
test of the effect of adding stress as both a fixed and random slope
into the model). However, this multi degree of freedom test masks the
fact that the effect is really just driven by the fixed slope of
stress, not the random slope.
Finally, both AIC and BIC favor m1 over the other models, BIC by
several points, suggesting that m1 is the best balance of complexity
and model fit.
THe main point of this final exercise is to show how you can use LRTs
and AIC and BIC to evaluate whether random effects "help" a model
fit.
# Effect Sizes
As with other statistical models, relying solely on statistical
significance is limitting as it ignores the magnitude of an effect.
Simple $R^2$ measures of variance accounted for are problematic,
however, recently several approaches have been put forward.
To understand these measures, let's go back to a relatively simple LMM
with a fixed effect predictor and a random intercept only.
$$
y_{ij} = b_{0j} + b_1 * x_j + \varepsilon_{ij}
$$
Recall that we separate the random intercept as:
$$
b_{0j} =
$$
substituting we get:
$$
y_{ij} = \gamma_{00} + u_{0j} + b_1 * x_j + \varepsilon_{ij}
$$
and re-arranging slightly, we can break the model down into
essentially three "parts":
$$
y_{ij} = \overbrace{\gamma_{00} + b_1 * x_j}^\text{fixed} + \underbrace{u_{0j}}_\text{random} + \overbrace{\varepsilon_{ij}}^{residual}
$$
For each of these three parts, we can calculate the variance
associated with them. First the variance of the fixed effects portion
of the model:
$$
\sigma^{2}_{fixed} = Var(\gamma_{00} + b_1 * x_j)
$$
Next the variance of the random effects part of the model, here the
random intercept:
$$
u_{0j} \sim \mathcal{N}(0, \sigma^{2}_{u_{0}})
$$
Finally, the variance of the residuals (the unexplained part):
$$
\varepsilon_{ij} \sim \mathcal{N}(0, \sigma^{2}_{\varepsilon})
$$
Nakagawa and Schielzeth
(2013; https://doi.org/10.1111/j.2041-210x.2012.00261.x)
defined two variance explained measures defined based on these
components as follows:
$$
R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$
and
$$
R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$
As it is difficult to write $R^{2}_{LMM(m)}$, I refer to this as the
Marginal R2. It captures the proportion of the total variance
attributable to the fixed effects portion of the model.
Likewise, $R^{2}_{LMM(c)}$ is referred to as the Conditional R2. It
captures the proportion of the total variance attributable to both the
fixed and random effects portion of the model (i.e., it only excludes
the residual variance).
The Conditional R2 essentially measures the total proportion of
variability explainable by the model. The Marginal R2 measures the
total proportion of variability explainable only by the fixed effects,
the 'average' or marginal effect in the population. Both are useful in
different circumstances. Because the Conditional R2 incorporates
random effects, it is **not** a measure of how much variance we could
predict in a new data. For example if we had a LMM built and then
applied the model to data on a new patient who came to a clinic,
without their data we will not know what their random effects should
be, so we could only apply the average effects, the fixed
effects. Thus, the Marginal R2 can give us a sense of how much
variance we might explain using only the averages from our model and
how much we might explain if we had data on someone new.
The Conditional R2 uses both the fixed and random effects and can tell
us how much variance, essentially in our own sample, our model is able
to explain.
One caveat: it is technically possible for the Marginal R2 to
**decrease** when more predictors are added to the model. This does
not happen in linear regression, but it can happen with the Marginal
R2. It generally does not happen, but it can. This is simply a
limitation of how LMMs are estimated.
Also note that if you have an intercept only model, that is a model
with only a fixed and a random intercept, there will be no variability
associated with the fixed effects (since there is only an intercept
and it does not vary) and in this special case, the Conditional R2 is
identical to the ICC.
Although we looked at the formula for Marginal R2 and Conditional R2
from a LMM with only a random intercept and no other random effects,
Johnson (2014; https://dx.doi.org/10.1111/2041-210X.12225) has shown
they can be extended to random slope models. Although additional
random components are included, the interpretation of the Marginal R2
as being the proportion of total variance attributable to fixed
effects only and Conditional R2 being the proportion of total variance
attributable to the model, the fixed and random effects together,
remains the same.
In `R`, we can calculate these measures using the `R2()` function.
To start with, let's look at an intercept only model and compare the
R2 results to the ICC.
```{r}
m0b <- lmer(mood ~ 1 + (1 | ID), data = dm)
summary(m0b)
R2(m0b)
iccMixed("mood", "ID", data = dm)
```
Here we can see that the Marginal R2 is 0, which makes sense as we
have no fixed effects predictors, only the fixed effect intercept,
which does not vary (because it is *fixed*). The Conditional R2 is
equal to the ICC, because it captures the proportion of total variance
attributable to the fixed and random effects, but the fixed effects
variance is 0, so its just the variance of the random intercept over
the total variance, which is the equation for the ICC.
Now let's look at a more complex example. Here we use the model with a
linear stress trend predicting mood as a fixed effect with a random
intercept.
```{r}
summary(m1)
R2(m1)
```
Here we can see that the fixed effect of stress explains about 10% of
the variance in mood (Marginal R2). The model overall explains about
37% of the variance in mood (Conditional R2).
We can also calculate effect sizes that are equivalent to Cohen's
$f^2$ effect size. Recall that:
$$
f^2 = \frac{R^2}{1 - R^2}
$$
So we can define
$$
f^2_{LMM(m)} = \frac{R^2_{LMM(m)}}{1 - R^2_{LMM(m)}}
$$
and
$$
f^2_{LMM(c)} = \frac{R^2_{LMM(c)}}{1 - R^2_{LMM(c)}}
$$
which are a Marginal and Cnditional F2 effect size. We can get all of
these and other model performance measures, like AIC and BIC using the
`modelPerformance()` function.
```{r}
modelPerformance(m1)
```
# Effect Sizes for Differences
Often we are interested in the effect size for a particular model
parameter or a particular variable in our model. For example, we may
want to know what is the unique effect size for stress predicting
mood? The overall model effect sizes are not the right answer to this
question. We need to calculate how much the model performance
*changed* when just the parameter(s) of interest changed. That is, we
need to compare two, nested models. For these, it is best to use ML,
*not* REML. Here we will fit two models with m.a nested within m.ab.
This way, we can ensure that the only difference between the two
models is the addition of a fixed effect slope of stress.
```{r}
m.ab <- lmer(mood ~ stress + (1 | ID), data = dm, REML = FALSE)
m.a <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)
nobs(m.ab)
nobs(m.a)
```
Now we can define a specific effect size:
$$
f^2 = \frac{R^2_{AB} - R^2_{A}}{1 - R^2_{AB}}
$$
For simplicity, I am just writing the formula once, but we can
calculate a Marginal F2 and Conditional F2 by using Marginal R2 or
Conditional R2. The key here is in the top of the fraction: we
calculate the unique variance explained by calculating the difference
in variance explained between our two models. If the two models differ
*only* by one parameter/predictor as in this case, then the effect
size is for one specific predictor. However, we can use this to
calculate an effect size for multiple parameters. Later, we will see
some cases where we would care about this.
Let's calculate this now for stress.
```{r}
R2(m.ab)
R2(m.a)
## Marginal F2 for stress
(.10330 - 0) / (1 - .10330)
## Conditional F2 for stress
(.37064 - .31954) / (1 - .37064)
```
Let's extend this. Suppose we added stress as both a fixed and random
slope and wanted to get the *overall* effect size for stress.
```{r}
m.abc <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)
R2(m.abc)
R2(m.a)
## Marginal F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.09810 - 0) / (1 - 0.09810)
## Conditional F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.40187 - .31954) / (1 - 0.40187)
```
We could also ask, how much does the effect size change when we added
stress as a random slope on top of the model with stress as a fixed
effect slope only?
```{r}
R2(m.abc)
R2(m.ab)
## Marginal F2 for random effect of stress
(0.09810 - 0.10330) / (1 - 0.09810)
## Conditional F2 for random effect of stress
(0.40187 - 0.37064) / (1 - 0.40187)
```
In this case, we get the (seemingly odd) negative value for the
Marginal F2 and a small effect size for the Conditional F2.
According to Cohen's (1988) guidelines, which should be treated as
rough rules of thumb only:
*$f^2 \geq 0.02$ is small
*$f^2 \geq 0.15$ is medium
*$f^2 \geq 0.35$ is large
whether those cut offs well translate into LMMs and Marginal and
Conditional F2 (which certainly was not the context Cohen would have
imagined them being applied to) is unclear.
We could also use LRTs to test the overall (fixed + random) or random
only effect of stress as these are nested models.
```{r}
## overall effect (fixed + random)
anova(m.a, m.abc, test = "LRT")
## random effect only
anova(m.ab, m.abc, test = "LRT")
```
In the case of a variable like stress, which has not been decomposed
into a purely between person and a purely within person variable, it
could explain both variability from the random intercept and the
residual variance, even as a fixed effect. Further because the
Conditional F2 includes *both* the fixed and random variance from the
model, a fixed effect *can* influence the Conditional F2.
The random effect alone, however, would not be expected to be
associated with any improvement in the Marginal F2, so we'd expect a
Marginal F2 for the random effect alone about 0 with the random effect
only showing up on the Conditional F2. We can see that overall adding
a fixed and random effect of stress is statistically significant and
that overall (fixed + random slope of stress; p < .001), we get small
to moderate (by Cohen's guidelines, dubiously applied to LMMs) effect
sizes on both the Marginal F2 (.109) and Conditional F2s (.138).
We can also see that the addition of a random stress slope beyond the
fixed effect stress slope is not statistically significant (p = .18)
and is associated with only a relatively small Conditional F2 (.052).
Note that I am not commenting on the Marginal F2 for the random effect
only as we would not expect it to be anything other than 0 anyways, so
its not interesting to report / interpret.
Another point that is worth noting is that to get all of this
information, we actually had to fit three different models and make
various comparisons between all three of them. That is quite a bit of
effort. If we had a model with multiple predictors, fitting three
models to test each predictor would be a lot of effort.
The `modelTest()` function actually does all of this for us
basically. Let's run it and then use `APAStyler()` to get some nicely
formatted output.
```{r}
APAStyler(modelTest(m.abc))
```
Previously, we saw this output and only focused through the fixed and
random effects. Now let's look at some of the other output. We have
the model AIC and BIC provided, which can be helpful for comparing
between different models. We have the overall model Marginal and
Conditional R2 and F2. We also have effect sizes and some p-values for
labeled "Effect Sizes". Specifically:
* stress (Fixed + Random) - this contrasts the model with stress as a
both a fixed and random effect to a model without stress as either a
fixed or random effect (m.a vs m.abc in our previous work). The
`Est` column lists the Marginal/Conditional F2, here Marginal = 0.11
and Conditional = 0.14. The p-value comes from a LRT comparing the
two models. This p-value, p < .001, is the overall test of whether
adding both the fixed and random slope of stress improves the LL of
the model.
* stress (Random) - this contrasts the model with stress as both a
fixed and random effect to a model without stress as a random slope
but with stress as a fixed effect (m.ab vs m.abc in our previous
work). The `Est` column lists the Marginal/Conditioanl F2, here for
a random effect only the marginal is not interesting but the
Conditional F2 is useful. The p-value again comes from a LRT testing
whether the addition of stress a a random slope significantly
improves the LL. Although caution is always warranted on p-values
for variances of random effects, it can be a helpful guide and the
LRT is about the best possible p-value that can be computed (short
of a bootstrapped one).
Thus, using `modelTest()` we can actually get by default a lot of the
comparisons and effect sizes that we may be interested in and we can
get effect size estimates *for specific predictors*. A test of stress
only as a fixed effect is not included because it does not make sense
to test the addition of a fixed effect onto a model with a random
effect but no fixed effect.
`APAStyler()` can take a list of models as well, so its possible for
even more comparisons. For example, we might wonder the impact of a
covariate, say age.
```{r}
m.abcov <- update(m.abc, . ~ . + age)
APAStyler(list(
Unadjusted = modelTest(m.abc),
Adjusted = modelTest(m.abcov)))
```
Now we get all the estimates for stress as before but also those are
repeated for the model with age. We can see that age has a basically 0
effect size either marginal or conditional. We can also see that both
the AIC and BIC favor the unadjusted model and that the fixed effect
coefficient and standard deviation fro the random stress slope are
basically unchanged with or without age in the model.