LMM_Comparison.rmd

---
title: "Linear Mixed Models (LMMs) - Model Comparisons"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output: 
  tufte::tufte_html: 
    toc: true
    number_sections: true
---

Download the raw `R` markdown code here
[https://jwiley.github.io/MonashHonoursStatistics/LMM_Comparison.rmd](https://jwiley.github.io/MonashHonoursStatistics/LMM_Comparison.rmd).
These are the `R` packages we will use.

```{r setup}

options(digits = 5)

library(data.table)
library(JWileymisc)
library(extraoperators)
library(lme4)
library(lmerTest)
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)
library(haven)

## load data collection exercise data
## merged is a a merged long dataset of baseline and daily
dm <- as.data.table(read_sav("Merged.sav"))

```

# Comparing Models

For many statistical models, including LMMs, it can be informative to
compare different models. Comparing different models can be used in
lots of different ways. Here are some examples, although they are not
meant to be exhaustive.

* Evaluate which of two (or more) models provides the best fit to the
  data
* Evaluate / compare how the results for a particular predictor(s) of
  interest change across two (or more) models
* Calculate the significance of multiple predictors
* Calculate effect sizes for one or multiple predictors

We will look at examples of the different uses of model comparisons in
this topic. In fact, we already saw one example in the topic on
Moderation where we compared the results, by eye, from two models that
differed only in whether some extreme values were included or
excluded. This is a type of model comparison, and what we learned in
that example was that the results of the model 
**were not sensitive to the two extreme values**.

Just as there are many different kinds of models we can fit, even with
LMMs (e.g., with or without random slopes, etc.), so to there are many
different kinds and purposes for different model comparisons. 

To begin with, let's imagine we are just trying to compare two models:
Model A and Model B. We can broadly classify the type of comparison
based on whether Model A and B are nested or non-nested models. We
will talk about what these mean next.

## Nested Models

Models are considered nested when one model is a restricted or
constrained version of another model. For example, suppose that 
**Model A** predicts mood from stress and age whereas 
**Model B** predicts mood from age only. Written as a formula,
these could be:

$$
ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij}
$$

and

$$
ModelB: mood_{ij} = b_{0j} + b1 * age_j + 0 * stress_{ij} + \varepsilon_{ij}
$$

In **Model B**, I purposely used $0 * stress_{ij}$ to highlight that
when a predictor is left out of a model, it is the same as fixing
(sometimes also called constraining) the coefficient ($b2$ in this
case) to be exactly 0. In this case, we would say that 
**Model B** is nested within **Model A**. In other words, 
**Model A** contains every predictor and parameter in **Model B** plus
more.

Restricted 

This idea is similar to the idea of nested data used in LMMs, but the
difference is that we are not talking about observations or
data, rather we are talking about the parameters of a model.

To summarize, briefly, we say that **Model B** is *nested* within
**Model A** if:

* setting one or more parameters in **Model A** to 0 yields the same
  model as **Model B**.
* both models use the same data, have the same number of observations,
  etc. Even if the same dataset is used in `R`, this does not
  guarantee the same data are used because if additional predictors
  are included in **Model A** and these extra predictors have some
  missing data, by default `R` would drop the missing data and result
  in a subset of cases being used for **Model A** than for **Model B**
  
If two models are nested, then we have the most options in terms of
comparing the two models. For example, we can evaluate whether 
**Model A** is a statistically significantly better fit than is
**Model B** using a Likelihood Ratio Test (LRT)^[https://en.wikipedia.org/wiki/Likelihood-ratio_test]. 

We can compare the fit of each model and use the difference in fit to
derive effect sizes. We also can attribute any differences in the two
models to the parameter(s) that have been constrainted to 0 in **Model
B** from **Model A**.

A simple definition of the LRT test statistic, $\lambda$ is based on
two times the difference in the log likelihoods.

$$
\lambda = -2 * (LL_B - LL_A)
$$

You may wonder why the likelihood *ratio* test can be conducted by
taking the difference in the log likelihoods. It is because the log of
a ratio is the same as the difference in the logs of the numerator and
denominator.

$$
log_{e}\left(\frac{6}{2}\right) = log_{e}(6) - log_{e}(2)
$$

which we can confirm is true in `R`:

```{r}

## log of ratio
log(6/2)

## difference in logs
log(6) - log(2)

```

If the null hypothesis of no difference is true in the population,
then $\lambda$ will follow a chi-squared distribution with degrees of
freedom equal to the number of parameters constrained to 0 in **Model
B** from **Model A**, the difference in degrees of freedom used for
each model, that is:

$$
\lambda \sim \chi^2(DF_A - DF_B)
$$

Thus we often use a chi-square distribution in the LRT to look up the
p-value. Finally, note that because LRTs are based on the log
likelihoods (LL) from a model, we need true log likelihoods for the
LRT to be valid. Therefore, we cannot use restricted maximum
likelihood, we need to use maximum likelihood estimation.

### Nested Models in `R`

To see the idea of nested models and the LRT in action, let's examine
a concrete example in `R`. Here are two LMMs corresponding to 
**Model A** and **Model B** formula we wrote previously. We can see in
the `R` code that the models are nested, the only difference is
`age`. We set `REML = FALSE` to get maximum likelihood estimates so
that we get true log likelihoods. We also need to confirm that the two
models are based on the same number of observations. We can extract
just this information in `R` using the `nobs()` function. This let's
us confirm that the two models are fitted to the same data. For
example, if stress had some missing data that was not missing age or
mood, it could be that **Model A** is based on fewer observations than 
**Model B**.

```{r}

modela <- lmer(mood ~ age + stress + (1 | ID), data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + (1 | ID), data = dm, REML = FALSE)

nobs(modela)
nobs(modelb)

``` 

In this case, we can see that the number of observations are
identical. Now we can find the log likelihoods of both models by using
the `logLik()` function.

```{r}

logLik(modela)
logLik(modelb)

``` 

Now we have the log likelihoods (LL) of each model and the degrees of
freedom from both models. From this, we can calculate $\lambda$ and
then look up the p-value for a chi-square distribution with $\lambda$
and 1 degree of freedom (1 is the difference in degrees of freedom
between Model A and B). To get the p-value from a chi-square, we use
the `pchisq()` function.

```{r}

## lambda
-2 * (-290.86 - -281.71)

## p-value from a chi-square
pchisq(18.3, df = 1, lower.tail = FALSE)

``` 

In practice, we do not need to do these steps by hand, we can get a
test of two nested models in `R` using the `anova()` function (which
in this case is not actually analyzing the variance really). We use
the `anova()` function with `test = "LRT"` to get a likelihood ratio
test.

```{r}

anova(modela, modelb, test = "LRT")

```

If you compare the chi-square value, degrees of freedom and p-value,
you'll see that they basically match what we calculated by hand. The
small differences are due to rounding (`R` will use more decimals of
precision whereas we only used two decimals).

In this simple case, we are only testing a single parameter, the fixed
regression coefficient for stress, because that is the only parameter
that differs between **Model A** and **Model B**. Thus in this case,
it would be easier to rely on the t-test we get from a summary of the
model. 

```{r}

summary(modela)

``` 

In this instance, the LRT and the t-test actually yield equivalent
p-values, 1.9e-05, however, this does not have to be. The LRT is based
on slightly different theory than the t-test and the t-test uses
approximate degrees of freedom for the t-distribution based on the
Satterthwaite method, whereas the LRT does not directly incorporate
the sample size in the same way. The two methods will be
assymptotically equivalent (at very large sample sizes) but can
differ, particularly for smaller samples.

In practice, we wouldn't usually use a LRT to evaluate whether a
single model parameter is statistically significant. The benefit of
(nested) model comparisons is that it allows us to compare two
models. Those models can be quite different.

Here are another two models, this time they differ by two predictors,
energy and female. The LRT now tests whether the addition of energy
and female results in significantly better fit for **Model A** than 
**Model B**.

```{r}

modela <- lmer(mood ~ age + stress + energy + female + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + stress + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)
nobs(modelb)

anova(modela, modelb, test = "LRT")

``` 

In this case, we can see from the significant p-value that **Model A**
is a significantly better fit to the data than is **Model B**.
Note that LRTs are only appropriate when the models are nested. We
cannot use LRTs for non-nested models.

In nested models, the more complex model is often called the "Full"
model and the simpler model the "Reduced" or "Restricted" version of
the Full model.

## Non Nested Models

Models are considered non nested when one model is not strictly a
constrained version of a more complex model. For example, suppose that 
**Model A** predicts mood from stress and age whereas 
**Model B** predicts mood from age and female. Written as a formula,
these could be:

$$
ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij}
$$

and

$$
ModelB: mood_{ij} = b_{0j} + b1 * age_j + b2 * female_j + 0 * stress_{ij} + \varepsilon_{ij}
$$

Although **Model B** does have age which also is present in **Model
A** and its restricted stress to be 0, it has another addition,
female, which is not in **Model A**. In this case, the two models are
not nested. In this case, although we could fit both models and they
are both on the same number of observations, so the outcome is the
same in both models, the models are not nested. Although we can still
ask `R` to conduct a LRT, this LRT is not valid. It is shown here to
highlight that you as the analyst are responsible for determining
whether your two models are nested or not and therefore deciding
whether a LRT is an appropriate way to evaluate whether the models are
significantly different from each other, or not.

```{r}

modela <- lmer(mood ~ age + stress + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + female + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)
nobs(modelb)

## this is NOT appropriate LRT
anova(modela, modelb, test = "LRT")

``` 

Although we cannot conduct a LRT on non nested models, it *is* still
useful to compare the fit of non-nested models. For example, if one is
a much better fit than another model, that may suggest one set of
predictors is superior or can be used to evaluate competing hypotheses
(e.g., Theory 1 says that stress and family history are the most
important predictors of mood and Theory 2 says that age and sleep are
the best predictors of mood --- these two theories are competing, not
nested versions of each other).

We cannot use LRTs to compare non nested models, but we can use other
measures, including performance measures such as variance explained or
model accuracy and information criterion. We will talk about
information criterion next.

## Information Criterion

Two common information criterion are:

* AIC: the Akaike Information Criterion
  (AIC)^[https://en.wikipedia.org/wiki/Akaike_information_criterion] 
* BIC: the Bayesian Information Criterion
  (BIC)^[https://en.wikipedia.org/wiki/Bayesian_information_criterion]
  also sometimes called the Schwarz Information Criterion

Both the AIC and BIC are calculated based primarily on the log
likelihood, LL, of a model. One way of thinking about these
information criterion is that you could think about models being a way
of approximating reality. Suppose you have two models that both
generate approximations (predictions) of reality. The model whose
predictions are closer to the observed data will have a higher LL.
The LL can be used as a *relative* measure of model fit. LL **is not**
an absolute measure of fit. That is, we do not interpret a specific LL
value as indicating "good" fit. Only which of a set of (all
potentially bad) models is the best.

However, there is a limitation with using LL alone. The LL will always
stay the same or increase as additional predictors / parameters are
added to the model. Thus if we use the LL alone, out of a set of
competing models, we would virtually always pick the more complex
models. To address this, we need to incorporate some penalty for the
complexity of a model, so that to choose a more complex over simpler
model as the "best" it has to improve the LL *enough*.

The AIC and BIC are very similar except that they use different
penalties for model complexity (technically they are derived from
rather different theoretical foundations, but for practical purposes
they are similar other than the complexity penalties).

A common way of defining model complexity is based on the number of
estiamted model parameters, $k$. For example, consider the following
simple linear regression for $n$ different observations:

$$
\begin{align}
\eta_{i} &= b_0 + b_1 * x_i\\
y_i &\sim \mathcal{N}(\eta_i, \sigma_{\varepsilon_i})\\
\end{align}
$$

The parameters are:

$$
b_0, b_1, \sigma_{\varepsilon_i}
$$

so $k = 3$.

The equations for AIC and BIC are quite easy to follow and looking at
them helps understand where they are similar and different.

$$
\begin{align}
AIC &= 2 * k - 2 * LL \\
BIC &= log_{e}(n) * k - 2 * LL \\
\end{align}
$$

$n$ is the number of observations included in the model and $LL$ is
the log likelihood, where higher values indicate a better fitting
model. These equations highlight that the only difference between the
AIC and BIC is whether the number of parameters in the model, $k$ are
multiplied by $2$ (AIC) or $log_{e}(n)$. Thus, the AIC and BIC will be
identical if:

$$
\begin{align}
log_{e}(n) &= 2 \\
n &= e^2 \approx 7.39 \\
\end{align}
$$

If $n < e^2 \approx 7.39$ then the BIC will have a weaker penalty
based on the number of parameters, $k$, than the AIC.
If $n > e^2 \approx 7.39$ then the BIC will have a stronger penalty
based on the number of parameters, $k$, than the AIC.
Functionally, a stronger penalty on the number of parameters means
that a particular information criterion will tend to favor more
parsimonious (less complex) models. Thus, for all but the tiniest of
sample sizes ($\approx 7.39$) the BIC will have a stronger penalty on
model complexity and so will favor relatively more parsimonious models
than will AIC.

For both AIC and BIC, the *relatively* better model of those compared
is the model with the lower value (i.e., lower = better for both AIC
and BIC).

**There is no "right" or "wrong" information criterion to use.**
People use both the AIC and/or the BIC. If both AIC and BIC suggest
the same model is the "best" there is no ambiguity. Sometimes the AIC
and BIC disagree regarding which model is the best. In these cases one
must pick which information criterion to go with. Both criterion
require that the number of observations, $n$, be larger than the
number of parameters, $k$ for them to operate well.

A benefit of the AIC and BIC is that both can be used to compare non
nested models and they also can be used to compare nested models. It
is relatively common to use AIC and/or BIC to "choose" amongst a set
of possible models.

In `R` we can usually calculate AIC and BIC using the functions,
`AIC()` and `BIC()`. Here we will calculate the AIC and BIC for our
two models.

```{r}

AIC(modela, modelb)

BIC(modela, modelb)

```

In this case, both the AIC and BIC are lower for **Model A** than for
**Model B**, indicating that **Model A** is the optimal of those two
models. Again, the relative interpretation is important. We cannot
conclude that **Model A** is a "good" model, only that it is better
than **Model B**.

## Model Selection

Integrating what we have learned about LRT for nested models and
AIC/BIC for nested or non nested models, we can compare across several
models to select the best model.

First, we fit a series of models. In all cases, mood is the
outcome. First we have an intercept only model (m0), an unconditional
model, as a baseline reference. This is helpful as all the fit indices
are relative. Then we fit polynomials of stress with degree 1 (linear;
m1); degree 2 (quadratic; m2); degree 3 (cubic; m3). 
**If you are not familiar with polynomials, please
see the extra section at the end of this lecture now.** 
We fit different degree polynomials of stress using the `poly()`
function in `R`. Finally we fit a competing model, maybe a linear
effect of energy is a better predictor of mood than stress. 

Next, we check that all the observations are identical across
models. 
**If the observations were not the same across models, we would need
to create a new dataset that had any missing data on any variable used
in any of the models excluded.**  This is a critical step if needed,
because LRTs and AIC and BIC are only valid if based on the same
data.

```{r}

dm[, stress := as.numeric(stress)]
dm[, mood := as.numeric(mood)]

m0 <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)
m1 <- lmer(mood ~ poly(stress, 1) + (1 | ID), data = dm, REML = FALSE)
m2 <- lmer(mood ~ poly(stress, 2) + (1 | ID), data = dm, REML = FALSE)
m3 <- lmer(mood ~ poly(stress, 3) + (1 | ID), data = dm, REML = FALSE)
malt <- lmer(mood ~ energy + (1 | ID), data = dm, REML = FALSE)

## check all the observations are the same
nobs(m0)
nobs(m1)
nobs(m2)
nobs(m3)
nobs(malt)

```

Now we can see which model is the best fit. For the nested models
(m0 - m3) we can use LRTs. Technically, m0 also is nested in malt, so
we can use a LRT for that too. We cannot use a LRT for, say, m1 and
malt as those mdoels are not nested. We can use AIC and BIC for all
models, though. Note that with multiple models, the `anova()` function
is doing sequential LRT tests (e.g., m0 vs m1; m1 vs m2, etc.) not all
compared to one model. If you want two specific models compared,
specify just two.

```{r}

#### LRTs for nested models ####

## two specific comparisons
anova(m0, m3, test = "LRT")

## sequential comparisons
anova(m0, m1, m2, m3, test = "LRT")
anova(m0, malt, test = "LRT")

## AIC and BIC for nested and non nested models
AIC(m0, m1, m2, m3, malt)
BIC(m0, m1, m2, m3, malt)

```

From the LRT we can see that m3 is significantly better fit than
m0. Looking at the sequential tests, however, m2 is no better than m1
and m3 is no better than m2, which might suggest m1 is the best
model. If we look at the AIC values, for the models with stress, AIC
selects m2 as the best model (by .05 compared to m1, very close). BIC
favors more parsimonious models and selects m1 over m2.
However, both AIC and BIC indicate that the alternate model (malt) is
the best of all the models evaluated, suggesting that linear energy is a
better predictor of mood than is linear, quadratic, or cubic
polynomials of stress.

If we were picking a stress model only, we would pick between the
linear (m1; LRT, BIC) and quadratic (m2; AIC) model, depending which
method we used. If we were willing to pick energy over stress, we
would clearly go with energy.

In addition to testing fixed effects, these same methods can be used
to test random effects. Here we will fit a new model, m4, that include
a random slope of stress and the correlation between the random
intercept and stress slope. Then we compare the linear stress with
random slope (m4), linear stress without random slope (m1) and
intercept only model (m0).

```{r}

m4 <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)

## check all the observations are the same
nobs(m0)
nobs(m1)
nobs(m4)

anova(m0, m1, test = "LRT")

anova(m1, m4, test = "LRT")

anova(m0, m4, test = "LRT")

AIC(m0, m1, m4)
BIC(m0, m1, m4)

```

From these tests we can conclude several things. Firstly adding a
linear trend of stress as a fixed effect is signfiicantly better than
the intercept only model. Secondly, adding a random slope of stress
and correlation of the random slope and intercept is not significantly
better than the fixed linear effect of stress model. Thirdly, the
model with stress as both a fixed and random slope is signficantly
better than the intercept only model (m4 vs m0 is kind of an omnibus
test of the effect of adding stress as both a fixed and random slope
into the model). However, this multi degree of freedom test masks the
fact that the effect is really just driven by the fixed slope of
stress, not the random slope. 

Finally, both AIC and BIC favor m1 over the other models, BIC by
several points, suggesting that m1 is the best balance of complexity
and model fit.

THe main point of this final exercise is to show how you can use LRTs
and AIC and BIC to evaluate whether random effects "help" a model
fit.

# Effect Sizes

As with other statistical models, relying solely on statistical
significance is limitting as it ignores the magnitude of an effect.

Simple $R^2$ measures of variance accounted for are problematic,
however, recently several approaches have been put forward.

To understand these measures, let's go back to a relatively simple LMM
with a fixed effect predictor and a random intercept only.

$$
y_{ij} = b_{0j} + b_1 * x_j + \varepsilon_{ij}
$$

Recall that we separate the random intercept as:

$$
b_{0j} = 
$$

substituting we get:

$$
y_{ij} = \gamma_{00} + u_{0j} + b_1 * x_j + \varepsilon_{ij}
$$

and re-arranging slightly, we can break the model down into
essentially three "parts":

$$
y_{ij} = \overbrace{\gamma_{00} + b_1 * x_j}^\text{fixed} + \underbrace{u_{0j}}_\text{random} + \overbrace{\varepsilon_{ij}}^{residual}
$$

For each of these three parts, we can calculate the variance
associated with them. First the variance of the fixed effects portion
of the model:

$$
\sigma^{2}_{fixed} = Var(\gamma_{00} + b_1 * x_j)
$$

Next the variance of the random effects part of the model, here the
random intercept:

$$
u_{0j} \sim \mathcal{N}(0, \sigma^{2}_{u_{0}})
$$

Finally, the variance of the residuals (the unexplained part):

$$
\varepsilon_{ij} \sim \mathcal{N}(0, \sigma^{2}_{\varepsilon})
$$

Nakagawa and Schielzeth 
(2013; https://doi.org/10.1111/j.2041-210x.2012.00261.x) 
defined two variance explained measures defined based on these
components as follows:

$$
R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

and 

$$
R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

As it is difficult to write $R^{2}_{LMM(m)}$, I refer to this as the
Marginal R2. It captures the proportion of the total variance
attributable to the fixed effects portion of the model.

Likewise, $R^{2}_{LMM(c)}$ is referred to as the Conditional R2. It
captures the proportion of the total variance attributable to both the
fixed and random effects portion of the model (i.e., it only excludes
the residual variance).

The Conditional R2 essentially measures the total proportion of
variability explainable by the model. The Marginal R2 measures the
total proportion of variability explainable only by the fixed effects,
the 'average' or marginal effect in the population. Both are useful in
different circumstances. Because the Conditional R2 incorporates
random effects, it is **not** a measure of how much variance we could
predict in a new data. For example if we had a LMM built and then
applied the model to data on a new patient who came to a clinic,
without their data we will not know what their random effects should
be, so we could only apply the average effects, the fixed
effects. Thus, the Marginal R2 can give us a sense of how much
variance we might explain using only the averages from our model and
how much we might explain if we had data on someone new.
The Conditional R2 uses both the fixed and random effects and can tell
us how much variance, essentially in our own sample, our model is able
to explain.

One caveat: it is technically possible for the Marginal R2 to
**decrease** when more predictors are added to the model. This does
not happen in linear regression, but it can happen with the Marginal
R2. It generally does not happen, but it can. This is simply a
limitation of how LMMs are estimated.

Also note that if you have an intercept only model, that is a model
with only a fixed and a random intercept, there will be no variability
associated with the fixed effects (since there is only an intercept
and it does not vary) and in this special case, the Conditional R2 is
identical to the ICC.

Although we looked at the formula for Marginal R2 and Conditional R2
from a LMM with only a random intercept and no other random effects,
Johnson (2014; https://dx.doi.org/10.1111/2041-210X.12225) has shown
they can be extended to random slope models. Although additional
random components are included, the interpretation of the Marginal R2
as being the proportion of total variance attributable to fixed
effects only and Conditional R2 being the proportion of total variance
attributable to the model, the fixed and random effects together,
remains the same.

In `R`, we can calculate these measures using the `R2()` function.
To start with, let's look at an intercept only model and compare the
R2 results to the ICC.

```{r}

m0b <- lmer(mood ~  1 + (1 | ID), data = dm)

summary(m0b)

R2(m0b)

iccMixed("mood", "ID", data = dm)

``` 

Here we can see that the Marginal R2 is 0, which makes sense as we
have no fixed effects predictors, only the fixed effect intercept,
which does not vary (because it is *fixed*). The Conditional R2 is
equal to the ICC, because it captures the proportion of total variance
attributable to the fixed and random effects, but the fixed effects
variance is 0, so its just the variance of the random intercept over
the total variance, which is the equation for the ICC.

Now let's look at a more complex example. Here we use the model with a
linear stress trend predicting mood as a fixed effect with a random
intercept.

```{r}

summary(m1)

R2(m1)

``` 

Here we can see that the fixed effect of stress explains about 10% of
the variance in mood (Marginal R2). The model overall explains about
37% of the variance in mood (Conditional R2).

We can also calculate effect sizes that are equivalent to Cohen's
$f^2$ effect size. Recall that:

$$
f^2 = \frac{R^2}{1 - R^2}
$$

So we can define

$$
f^2_{LMM(m)} = \frac{R^2_{LMM(m)}}{1 - R^2_{LMM(m)}}
$$

and

$$
f^2_{LMM(c)} = \frac{R^2_{LMM(c)}}{1 - R^2_{LMM(c)}}
$$

which are a Marginal and Cnditional F2 effect size. We can get all of
these and other model performance measures, like AIC and BIC using the 
`modelPerformance()` function.

```{r}

modelPerformance(m1)

``` 

# Effect Sizes for Differences

Often we are interested in the effect size for a particular model
parameter or a particular variable in our model. For example, we may
want to know what is the unique effect size for stress predicting
mood? The overall model effect sizes are not the right answer to this
question. We need to calculate how much the model performance
*changed* when just the parameter(s) of interest changed. That is, we
need to compare two, nested models. For these, it is best to use ML,
*not* REML. Here we will fit two models with m.a nested within m.ab.
This way, we can ensure that the only difference between the two
models is the addition of a fixed effect slope of stress.

```{r}

m.ab <- lmer(mood ~ stress + (1 | ID), data = dm, REML = FALSE)
m.a <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)

nobs(m.ab)
nobs(m.a)

``` 

Now we can define a specific effect size:

$$
f^2 = \frac{R^2_{AB} - R^2_{A}}{1 - R^2_{AB}}
$$

For simplicity, I am just writing the formula once, but we can
calculate a Marginal F2 and Conditional F2 by using Marginal R2 or
Conditional R2. The key here is in the top of the fraction: we
calculate the unique variance explained by calculating the difference
in variance explained between our two models. If the two models differ
*only* by one parameter/predictor as in this case, then the effect
size is for one specific predictor. However, we can use this to
calculate an effect size for multiple parameters. Later, we will see
some cases where we would care about this.

Let's calculate this now for stress.

```{r}

R2(m.ab)
R2(m.a)


## Marginal F2 for stress
(.10330 - 0) / (1 - .10330)

## Conditional F2 for stress
(.37064 - .31954) / (1 - .37064)

``` 

Let's extend this. Suppose we added stress as both a fixed and random
slope and wanted to get the *overall* effect size for stress.

```{r}

m.abc <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)

R2(m.abc)
R2(m.a)

## Marginal F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.09810 - 0) / (1 - 0.09810)

## Conditional F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.40187 - .31954) / (1 - 0.40187)

```

We could also ask, how much does the effect size change when we added
stress as a random slope on top of the model with stress as a fixed
effect slope only?

```{r}

R2(m.abc)
R2(m.ab)

## Marginal F2 for random effect of stress
(0.09810 - 0.10330) / (1 - 0.09810)

## Conditional F2 for random effect of stress
(0.40187 - 0.37064) / (1 - 0.40187)

```

In this case, we get the (seemingly odd) negative value for the
Marginal F2 and a small effect size for the Conditional F2.

According to Cohen's (1988) guidelines, which should be treated as
rough rules of thumb only:

*$f^2 \geq 0.02$ is small
*$f^2 \geq 0.15$ is medium
*$f^2 \geq 0.35$ is large

whether those cut offs well translate into LMMs and Marginal and
Conditional F2 (which certainly was not the context Cohen would have
imagined them being applied to) is unclear.

We could also use LRTs to test the overall (fixed + random) or random
only effect of stress as these are nested models.

```{r}

## overall effect (fixed + random)
anova(m.a, m.abc, test = "LRT")

## random effect only
anova(m.ab, m.abc, test = "LRT")

```

In the case of a variable like stress, which has not been decomposed
into a purely between person and a purely within person variable, it
could explain both variability from the random intercept and the
residual variance, even as a fixed effect. Further because the
Conditional F2 includes *both* the fixed and random variance from the
model, a fixed effect *can* influence the Conditional F2.
The random effect alone, however, would not be expected to be
associated with any improvement in the Marginal F2, so we'd expect a
Marginal F2 for the random effect alone about 0 with the random effect
only showing up on the Conditional F2. We can see that overall adding
a fixed and random effect of stress is statistically significant and
that overall (fixed + random slope of stress; p < .001), we get small
to moderate (by Cohen's guidelines, dubiously applied to LMMs) effect
sizes on both the Marginal F2 (.109) and Conditional F2s (.138). 
We can also see that the addition of a random stress slope beyond the
fixed effect stress slope is not statistically significant (p = .18)
and is associated with only a relatively small Conditional F2 (.052). 
Note that I am not commenting on the Marginal F2 for the random effect
only as we would not expect it to be anything other than 0 anyways, so
its not interesting to report / interpret.

Another point that is worth noting is that to get all of this
information, we actually had to fit three different models and make
various comparisons between all three of them. That is quite a bit of
effort. If we had a model with multiple predictors, fitting three
models to test each predictor would be a lot of effort.
The `modelTest()` function actually does all of this for us
basically. Let's run it and then use `APAStyler()` to get some nicely
formatted output.

```{r}

APAStyler(modelTest(m.abc))

``` 

Previously, we saw this output and only focused through the fixed and
random effects. Now let's look at some of the other output. We have
the model AIC and BIC provided, which can be helpful for comparing
between different models. We have the overall model Marginal and
Conditional R2 and F2. We also have effect sizes and some p-values for
labeled "Effect Sizes". Specifically:

* stress (Fixed + Random) - this contrasts the model with stress as a
  both a fixed and random effect to a model without stress as either a
  fixed or random effect (m.a vs m.abc in our previous work). The
  `Est` column lists the Marginal/Conditional F2, here Marginal = 0.11
  and Conditional = 0.14. The p-value comes from a LRT comparing the
  two models. This p-value, p < .001, is the overall test of whether
  adding both the fixed and random slope of stress improves the LL of
  the model.
* stress (Random) - this contrasts the model with stress as both a
  fixed and random effect to a model without stress as a random slope
  but with stress as a fixed effect (m.ab vs m.abc in our previous
  work). The `Est` column lists the Marginal/Conditioanl F2, here for
  a random effect only the marginal is not interesting but the
  Conditional F2 is useful. The p-value again comes from a LRT testing
  whether the addition of stress a a random slope significantly
  improves the LL. Although caution is always warranted on p-values
  for variances of random effects, it can be a helpful guide and the
  LRT is about the best possible p-value that can be computed (short
  of a bootstrapped one).
  
Thus, using `modelTest()` we can actually get by default a lot of the
comparisons and effect sizes that we may be interested in and we can
get effect size estimates *for specific predictors*. A test of stress
only as a fixed effect is not included because it does not make sense
to test the addition of a fixed effect onto a model with a random
effect but no fixed effect.

`APAStyler()` can take a list of models as well, so its possible for
even more comparisons. For example, we might wonder the impact of a
covariate, say age.

```{r}

m.abcov <- update(m.abc, . ~ . + age)

APAStyler(list(
  Unadjusted = modelTest(m.abc),
  Adjusted = modelTest(m.abcov)))

``` 

Now we get all the estimates for stress as before but also those are
repeated for the model with age. We can see that age has a basically 0
effect size either marginal or conditional. We can also see that both
the AIC and BIC favor the unadjusted model and that the fixed effect
coefficient and standard deviation fro the random stress slope are
basically unchanged with or without age in the model.

## Between Only Variables

So far for effect sizes, we were focused primarily on variables that
differed both between and within person (stress). However, although
the same strategy and effect sizes can be used for variables that vary
**only** between people or **only** within person, what results we
expect are a little different.

Also, although `modelTest()` uses the Wald method by default for
confidence intervals, just as for the `confint()` function the method
can be specified to pick what you want, such as profile likelihood.

Also note that in this case I am not comparing models so I will use
the default REML estimation rather than pure ML.

```{r}

dm[, c("Bstress", "Wstress") := meanDeviations(stress), by = ID]

m.between <- lmer(mood ~ Bstress + (1 | ID), data = dm)

APAStyler(modelTest(m.between, method = "profile"),
  pcontrol = list(digits = 3, stars = FALSE, includeP = TRUE,
                  includeSign = TRUE, dropLeadingZero = TRUE))

```

In the output we can see that parameters and CIs were based on
REML. However, the effect sizes are based on ML, because those are
based off of model comparisons. First of all, we can see in the effect
sizes for Bstress that the p-value differs very slightly from the
p-value from the fixed effects. This may be due either to minor
differences between REML and ML estimation or due to one test being a
t-test with approximate degrees of freedom.

The Marginal F2 for Bstress indicates that it has a small effect
size. The conditional F2 is basically zero and slightly negative. This
is not surprising as the addition of Bstress is not likely to improve
the fit of the conditional model over the random intercept
model. Individual differences in mean level of mood are already
captured in the random intercept model, so for the conditional model,
it will be very difficult for Bstress to actually improve the overall
variance explained. Instead, Bstress is able to shift what was
previously variance due to the random intercept and explain it so it
because variance attributable to the fixed effects portion of the
model. In terms of the equations:

$$
R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

and 

$$
R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

any gains in $\sigma^{2}_{fixed}$ will basically be matched by a
decrease in $\sigma^{2}_{u_{0}}$, thus while the marginal variance
explained increases, the conditional variance explained is more or
less constant, so no change (0 effect size).


In summary, we can say that there is a significant, negative
association between average stress and average mood (b [05% CI] =
-0.28 [-0.51, -0.05], p = .018), with a small effect size for the
fixed effects, $f^2$ = .06. The pattern of results is that higher
average stress was associated with lower average mood. 

Because we do not expect a meaningful impact on the Conditional F2, I
did not even mention it in the interpretation for a between only fixed
effect predictor.

## Within Only Variables

Here we have an example of a variable that **only** varies within, not
between individuals. This is interpreted basically the same was as a
variable that varies both between and within (like overall stress).

```{r}

m.within <- lmer(mood ~ Wstress + (1 | ID), data = dm)
summary(lmer(mood ~ 1 + (1 | ID), data = dm))

APAStyler(modelTest(m.within, method = "profile"),
  pcontrol = list(digits = 3, stars = FALSE, includeP = TRUE,
                  includeSign = TRUE, dropLeadingZero = TRUE))

```

In the output we can see both the Marginal and Conditional F2 are non
zero. Wstress varies within individuals so even though it is being
compared against a random intercept model, it can explain variance in
the residuals that the random intercept alone cannot explain.
In terms of the equations:

$$
R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

and 

$$
R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} +
\sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}}
$$

any gains in $\sigma^{2}_{fixed}$ will come from a decrease in
$\sigma^{2}_{\varepsilon}$. At the same time, 
$\sigma^{2}_{u_{0}}$, the variance of the random intercept will
basically not change as Wstress does not vary between people at all.
If the residual variance decreases, both the Marginal and Conditional
F2 will increase. Thus, we reasonably expect Wstress to increase both
the Marginal and Conditional F2. The Conditional F2 is larger in part
because of the formula for Cohen's $f^2$ which includes division by
the unexplained variance, which is much less from the conditional than
the marginal model (We can see that the Marginal R2 for the whole
model is 4% whereas the Conditional R2 for the whole model is 38%).

Note that for variables that vary both between and within person, they
may shift some variance from the random effect variance to the fixed
effect variance **and** from the residual variance to the fixed effect
variance. Of course if a variable is included as a random slope as
well, then it would also shift variance from the residuals to the
random effects, if the random slope does differ between people quite a
bit.

In summary, we can say that there is a significant, negative
association within stress and same day  mood (b [05% CI] =
-0.28 [-0.44, -0.13], p < .001), with a small effect size for the
fixed effects, $f^2$ = .05, and a small effect size for the overall
conditional model, $f^2$ = .09. The pattern of results is that higher
within stress was associated with lower same day mood. 

*Note that it is a coincidence that the fixed effect slope for between
and within stress happened to be the same at two decimal points.*


# Summary

## Conceptual

Key points to take away conceptually are:

- How comparing models is a powerful and flexible way to test many
  different hypotheses
- The differences between nested and non nested models, as well as
  what sorts of comparisons are valid for each
- What they are and how to interpret LRT, AIC, BIC
- What the Marginal R2 and Conditional R2 are for a LMM, including
  what sources of variance are incorporated into each.
- What the Marginal F2 and Conditional F2 effect sizes are for a LMM,
  including what sources of variance are incorporated into each.
- How polynomials can be used to model non-linear trends and how to
  select the "best" empirically using model comparison techniques.
- How to select the best model based on criteria and tools learned in
  this lecture.
- How to interpret Marginal and Conditional F2, and what types of
  variables (e.g., between and within, between only, within only)
  should use what types of effect sizes.

## Code

| Function       | What it does                                 |
|----------------|----------------------------------------------|
| `lmer()`     | estimate a LMM  |
| `R2()` | calculates Marginal and Conditional R2 for LMMs  | 
| `AIC()` | calculates AIC for LMMs  | 
| `BIC()` | calculates BIC for LMMs  | 
| `anova()` | can be used to compare two nested LMMs and calculate a LRT | 
| `modelTest()` | along with `APAStyler()` get a nicely formatted summary of a model results, including many automatic effect sizes. | 
| `poly()` | Fit a polynomial of specified degree for a predictor in a LMM. | 

--------------------------------------------------------------------------------------------------------------------------


# Extra - Polynomial Function Refresher

If you need a refresher on polynomials, read on, if you know the
basics, that's good enough and you don't need this section.
Wikipedia has more information:
https://en.wikipedia.org/wiki/Polynomial

A Polynomial Function is a mathematical function. For our purposes we
focus on only a single variable, $x$. The polynomial function is a
function of $x$ and involves just a few possible operations:

* addition
* subtraction,
* multiplication
* exponents, that are not negative and whole integers


The degree of a polynomial is based on the largest exponent in the
equation. The number of terms is not important for the degree. For
example:

$$
f(x) = 3 + x
$$ 

has degree = 1, a linear polynomial, because $x^1 = x$.

$$
f(x) = 3 + x^2
$$ 

has degree = 2, a quadratic polynomial.

$$
f(x) = 3 - 2 * x +  5 * x^2
$$ 

has degree = 2, also a quadratic polynomial.

$$
f(x) = 3 - 2 * x + 5 * x^2 - x^3
$$ 

has degree = 3, also a cubic polynomial.

We have special names for polynomials of different degrees:

* Degree 0 = constant
* Degree 1 = linear trend
* Degree 2 = quadratic trend
* Degree 3 = cubic trend
* Degree 4 = quartic trend

There are more (quintic, etc.) but in practice these basically never
come up in applied statistical methods in research.

Here are a few polynomial equations and graphs which may be more intuitive.

## Degree 0

```{r, fig.width = 9, fig.height = 4.5}

ggarrange(
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 -- Degree 0 Polynomial Example"),
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 4,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 4 -- Degree 0 Polynomial Example"),
  ncol = 2)

```

## Degree 1

```{r, fig.width = 9, fig.height = 4.5}

ggarrange(
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 + 3 * x,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 + 3 * x -- Degree 1 Polynomial Example"),
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 - 3 * x,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 - 3 * x -- Degree 1 Polynomial Example"),
  ncol = 2)

```

## Degree 2

```{r, fig.width = 9, fig.height = 4.5}

ggarrange(
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 + 3 * x + 1 * x^2,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 + 3 * x + 1 * x^2 -- Degree 2 Polynomial Example"),
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 + 3 * x - 1 * x^2,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 + 3 * x - 1 * x^2 -- Degree 2 Polynomial Example"),
  ncol = 2)

```

## Degree 3

```{r, fig.width = 9, fig.height = 4.5}

ggarrange(
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 + 3 * x + 1 * x^2 + x^3,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 + 3 * x + 1 * x^2 + x^3 -- Degree 3 Polynomial Example"),
  ggplot(data.frame(x = 0), aes(x)) +
  stat_function(
    fun = function(x) 1 + 3 * x + 1 * x^2 - x^3,
    size = 2) +
  xlim(-5, 5) +
  theme_pubr() +
  ylab("f(x)") +
  ggtitle("f(x) = 1 + 3 * x + 1 * x^2 - x^3 -- Degree 3 Polynomial Example"),
  ncol = 2)

```

Polynomials are a great, simple way to capture non-linear associations
between a variable and an outcome and are commonly used in regression
by including squared or cubed, etc. versions of a predictor, or using
the `poly()` function in `R`.