multi-level.qmd

---
title: "Multilevel Data Modelling"
---

*Recommended literature:*

Snijders, T. A., & Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Sage. 

## Load libraries {#sec-load.libraries}

```{r load packages, message=FALSE, results='hide'}
library(lme4)
library(lmerTest)
library(mitml)
library(sjlabelled)
library(tidyverse)
library(stargazer) ##comparing models
library(sjmisc)
library(lmerTest)
library(broom.mixed)
library(sjPlot) 
library(glmmTMB)
library(stargazer)
library(modelsummary)
```

## Multi level issue

-   Micro units are nested within macro units
-   Micro units impact on macro units and vice versa
    -   Micro units within a macro unit tend to be more similar to each other as compared to other micro units in other macro units
    -   Observations are not any longer independent of each other
    -   In statistical analysis this impacts on their variances
    -   Neglecting this issue leads likely at least to wrong variance estimates, in the worst case to biased estimates

→ Multilevel analysis is a suitable approach to take into account the social contexts as well as the individual respondents or subjects.

#### Sampling

-   Cluster sampling, multi-stage sampling are standard
-   Secondary units are not selected independently of each other
    -   having selected the primary units increases the chance of sampling the secondary units → Ignoring may heavily impact on the variance estimates and may thus produce type I error (reject a true null hypothesis)

![Sampling units](figures/1.Cluster.png){width="200"}

Two arguments for multilevel instead of single level regression of disaggregated data:

1.  Dependence as a nuisance:

-   Standard errors and tests base on single-level regression are suspect because the assumption of independent residuals is invalid.

2.  Dependence as an interesting phenomenon:

-   It is interesting in itself to disentangle variability at the various levels; moreover, this can give insight in the directions where further explanation may fruitfully be sought.

![multi level terms](figures/1.level_terms.png){width="200"}

#### Modelling

**Hierarchical linear model** is a type of regression analysis for multilevel data where the dependent variable is at the lowest level.

-   Explanatory variables can be defined at any level (including aggregates of level-one variables).
-   Also longitudinal data can be regarded as a nested structure; for such data the hierarchical linear model is convenient as well (with some particularities).

![Modelling](figures/1.model.png){width="200"}

### Exercise {#sec-ml.ex}

1.  Load library `library(lme4)`

Already done, see @sec-load.libraries.

2.  Load the data:

```{r}
DAT <- read.csv2("data/1.Example.txt")
```

3.  Have a look at the data

How many observations and variables?

```{r}
str(DAT)
```

In the data set for each pupil is measured:

-   which school he/she belongs to (schoolnr)
-   what her/ his unique pupil number is (pupuilNR_new)
-   what score the pupil reach in his/hers language test (our dependent variable) (langPOST)
-   what its socio-economic background is (ses) - a binary variable that indicates the sex, if the pupil belongs to a minority (Minority) - the verbal IQ of the pupil, centered by the mean (IQ_verbal)

And the dataset contains measures for each school

-   the schools socio-economic backgrounds (school_ses)
-   the verbal IQ of all pupils in a school (school_iqv)
-   an index how many pupils in a school belongs to a minority (school_min)

```{r}
length(unique(DAT$pupilNR_new))
```

3454 observations = pupils 11 variables over all. Each pupil has a pupil number, that is unique and is assigned to one of 211 schools.

For overview let's get some inside in the descriptives of each variable:

```{r}
summary(DAT)
```

The schoolnr for the 211 schools ranges from 1 to 259.

```{r}
table(is.na.data.frame(DAT))
```

No missing values in the data frame.

Get insight in the distribution of each variable:

```{r}
hist(DAT$ses)
hist(DAT$sch_ses)
hist(DAT$IQ_verb)
hist(DAT$sch_iqv)
hist(DAT$Minority)
hist(DAT$sch_min)
```

-   ses is left scewed,more pupils have a lower ses than the mean,- this is not true for schools
-   iqverb is slightly right scewed, range is wider on the negative site than on the possitve, this is also true for distribution of schools
-   distribution for the mean IQ has more outliers on the negative side than on the positive side
-   only under 10 percent of the pupils belongs to a minority, there are a lot schools where none of the pupils belongs to a minority group, while there are some schools, where all pupils belongs to a minority.

## Random Intercept Model (RIM) or Random Effects Model

-   $i$ indicates the individual level-one unit
-   $j$ indicates the group, level two unit
-   variables for individual $i$ in group $j$:
    -   $Y_{ij}$
    -   $x_{ij}$ explanatory variable at level one
-   for group $j$
    -   $z_j$ explanatory at level two
    -   $n_j$ group size\
        **OLS regression model** of $Y$ on $x$ ignoring groups

$$
Y_{ij} = \beta_0 + \beta_1x_{ij}+ R_{ij}
$$

$R_{ij}$ is not the error term, it is the residual term: An error term is generally unobservable and a residual is observable and calculable, making it much easier to quantify and visualize. In effect, while an error term represents the way observed data differs from the actual population, a residual represents the way observed data differs from sample population data.

**group dependent regression model** is dependent on groups, the following notation means, that the units on level 2 (indivudals) have random intercepts

$$
Y_ij = \beta_{0j} + \beta_{1j}x_{ij}+ R_{ij}
$$

In the random intercept model, the intercepts $\beta_{0j}$ are random variables representing random differences between groups, where $\beta_{0j} =$ average intercept $\gamma00$ (for all level two units) plus group-dependent deviation $U_{0j}$

$$
\beta_{0j} = \gamma_{00} + U_{0j}
$$

In this model, the regression coefficient $β_1$ is common to all the groups (otherwise random slope model)

In a fixed effects model each group has own intercept $\beta_{0j}$ to be estimated, because $\beta_{0j} =$ is not a random variable.

In the random intercept model, the constant regression coefficient $β_1$ is sometimes denoted $\gamma_{10}$ →

$$
Y_{ij} = \gamma_{00} + \gamma_{10}x_{ij}+U_{0j}+ R{ij}
$$\

In the hierarchical linear model, the $U_{0j}$ are random variables. The model assumption is that they are independent, normally distributed with expected value $0$, and variance from

$$
\tau^2 = var(U_{0j})
$$

The statistical parameter in the model is not their individual values, but their variance $tau^2$. $U_{0j}$ represent the random intercept for each cluster. It's really a residual term that measures the distance from each subject's intercept around the overall intercept $β_0$. Rather than calculate an estimate for every one of those distances, the model is able to just estimate a single variance. This is the **between group variance**.

The **within-group variance** is measured by the variance of residuals. Their sum is the total variance in $Y$ that is not explained by $X$.

![Random regression lines](figures/1.random-regression.png){width="400"}

### Fixed or Random Effects for Groups?

Arguments for choosing between fixed (F) and random (R) coefficient models for the group dummies:

1.  If groups are unique entities and inference should focus on these groups: F. This often is the case with a small number of groups. \$rarr; good for countries
2.  If groups are regarded as sample from a (perhaps hypothetical) population and inference should focus on this population: R This often is the case with a large number of groups.
3.  If group sizes are small and there are many groups, and it is reasonable to assume exchangeability of group-level residuals, then R makes better use of the data.
4.  If the researcher is interested only in within-group effects, and is suspicious about the model for between-group differences, then F is more robust.

-   In groups we need enough variance and distribution

5.  If group effects $U_{0j}$ (etc.) are not nearly normally distributed, R is risky (or use more complicated multilevel models).

A fixed effect and a random effect are two different statistical models used in data analysis, including in the context of meta-analysis and regression analysis. Here is a brief explanation of each:

#### Fixed Effect:

In a fixed effect model, the assumption is that the effects being studied are constant across all levels of the categorical variable(s) being considered. It assumes that the observed differences are due to the characteristics of the entities being studied, rather than being random or varying across different groups. In a fixed effect model, the parameters being estimated are specific to the entities or groups being analyzed. The fixed effect model is appropriate when the goal is to estimate the specific effects of each entity or group and to make inferences about those specific entities or groups. helen says

#### Random Effect:

In a random effect model, the assumption is that the effects being studied are not constant across all levels of the categorical variable(s) being considered. It assumes that the observed differences are due to a combination of the characteristics of the entities being studied and random variation. In a random effect model, the parameters being estimated represent the average effects across all entities or groups being analyzed, as well as the variability between entities or groups. The random effect model is appropriate when the goal is to estimate the average effects across all entities or groups and to make inferences about the population of entities or groups. It is important to choose the appropriate model based on the research question and the characteristics of the data. The choice between fixed effect and random effect models depends on the underlying assumptions and the goals of the analysis.

Here are some additional resources that may be helpful: pubmed.ncbi.nlm.nih.gov A basic introduction to fixed-effect and random-effects models ... - PubMed PMID: 26061376 DOI: 10.1002/jrsm.12 Abstract There are two popular statistical models for meta-analysis, the fixed-effect model and the random-effects model...

### Intraclass Correlation Coefficent

To deduce if multi-level is necessary & appropriate use the intraclass correlation coefficient.

If their is no correlation among the observations within a cluster, cluster means won't differ (no between-group variance)

Start with the empty model (random effects ANOVA) is a model without explanatory variables:

$$
Y_{ij} = \gamma_{00} + U_{0j} + R_{ij}
$$

Variance decomposition:

$$
var (Y_{ij}) = var (U_{0j}) + var (R_{ij}) =\tau^2 + \sigma^2
$$

Can the covariates on individual level be explained by the group j? Check for **intraclass correlation coefficient (ICC)**. The ratio of the between-cluster variance to the total variance is called the Intraclass Correlation. If the correlation is zero, then nothing of the variance comes in addition.

-   Covariance between two individuals (i ̸= k) in the same group j :

$$
    cov(Y_{ij} ,Y_{kj}) = var(U_{0j}) = \tau^2
$$

And their correlation with the intraclass correlation coefficient (ICC).

$$
ρ(Y_{ij} ,Y_{kj}) = ρ(Y) =\frac{\tau^2}{\tau^2 + \sigma^2}
$$

It is the proportion what is left unexplained in only looking at the single level. The higher the covariance is, the more multi level analysis is appropriate and can give explanation. The higher the difference between the classes, the higher is the intraclass correlation coefficient. → what are the threshold values indicating an RIM oder random coefficent model? 

-   Often between 0.05 and 0.25 in social science research, where the groups represent some kind of social grouping.

### Exercise

-   micro units: 3454 pupils
-   macro units: 211 schools
-   Y is a language test *langPost*

1.  Estimate empty model:

```{r}
mod1 <- lmerTest::lmer(langPOST #dependent variable
             ~  (1|schoolnr), #independent variable & random effect is in this case the school number
             data = DAT, #data
             REML = TRUE) # a robust estimator 
summary(mod1) #get summary
```

```{r}
fixef(mod1) # only fixed effects
confint(mod1) # for confindence intervals
```

`.sig01` stands for $\tau^2$ → does not be centered around zero, like expected. `.sigma` stands for $\sigma^2$

REML stands for [restricted maximum likelihood method](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood). LMER uses this because numerically fitting a mixed-effects model can be difficult and the maximum likelihood approach often fails.

The empty model shows, how the language test results differ between the schools. It is the estimated true group mean. Overall we have an estimate of 40.9224. The residual variance on the school level is 18.78 (variance), the one on the individual level is 63.20.

2.  Derive the ICC from the output. Assess whether it is really necessary to use a multi-level model for analyzing this data set.

-   $\tau^2$, is the variance of $U_{0j} = 18.78$
-   $\sigma^2$ is the variance of the residual $R_{ij} = 63.2$

```{r}
18.78 / (18.78 + 63.2)
```

Because ICC is between 0.05 and 0.25 an RIM is appropriate. It indicates, that there is a variance, that can be explained by schools → a between-group variance.

3.  What is the estimated mean of the total population of individual values $Y_{ij}$ and the corresponding standard deviation?

The estimated mean of the total population is 40.9224 (see the summmary of mod1, fixed effects and the intercept) with a standard error 0.3325.

The standard deviation is for this total population is

```{r }
#Multiply the mean SE (derived from the fixed effects, SE for the intercept) with the square root of the sample size.
0.3325 * (sqrt(3454))
```

4.  What is the estimated mean of the population of class means $\beta_{0j}$ and the corresponding standard deviation?

In this model no difference expected, because is the same, variance should be explained on the individual level

### Adding Covariates

The model becomes more interesting, when also fixed effects of explanatory variables are included:

$$
Y_{ij}  = γ_{00} + γ_10x_{ij} + U_{0j} + R_{ij}
$$

Note the difference between fixed effects of explanatory variables and fixed effects of group dummies.

There are two kinds of parameters: 1. fixed effects: regression coefficients γ (just like in OLS regression) 2. random effects: $U_{0j}$ determined by variance component $\tau_0^2$

### Exercise

1.  Add the covariate for IQ into the model on micro-level, **linear model** without multi-level

```{r}
mod2 <- lm(langPOST ~ IQ_verb, data=DAT) #independent micro variable
summary(mod2)
```

2.  Estimate also a **multi-level model (RIM)**, and compare both. Discuss!

```{r}
mod3 <- lmer(langPOST ~ IQ_verb + (1|schoolnr), data=DAT)
summary(mod3)
```

We can see, the model gives more explanation, if we used the linear mixed-effects model 3. The variance in model 3 for between group and within group is much more smaller than in model 1. The scaled residuals in model 3 are in an less expanded range, than the residuals in model 2, that indicates a better model fit.

```{r}
sigma(mod2)
sigma(mod3)
?sigma
```

In addition the overall variance is decreasing, too.

#### More Covariates

The model becomes more interesting, when also fixed effects of explanatory variables are included:

$$
Y_{ij} = γ_{00} \\
+ γ_{10}x_{1ij} + . . . + γ_{p0}x_{pij} \\
+ γ_{01z1j} + . . . + γ_{0qzqj}\\
+ U_{0j} + R_{ij}\\
$$


-   $γ_00$ is zero 
-   $+ γ_10x1ij + . . . + γp0xpij$ within-group (micro units) differences, for individual variable 1 you have a covariate for the unit belonging to a group. For variable $p$ you have also a covariate for the unit belonging to a group 
-   $+ γ_01z1j + . . . + γ_0qzqj$ all the between groups (macro units) coefficients,for group variable 1 you have a covariate for the unit belonging to a group. For variable p you have also a covariate for the unit belonging to a group

### Between- / Within-Groups

**What difference you are interested in?**

Difference between within-group and between-group regressions: 

-   The within-group regression coefficient ($+ γ_10x1ij + . . . + γp_0xpij$) is the regression coefficient within each group, assumed to be the same across the groups. → $\sigma^2$ 
-   The between-group regression coefficient ($+ γ_01z1j + . . . + γ_0qzqj$) is defined as the regression coefficient for the regression of the group means of $Y$ on the group means of $X$.→ $\tau^2$ 
-   This distinction is essential to avoid ecological fallacies → if you ignore, what happen on the lowest level, you will make ecological fallacies

### Ecological fallacy

(also ecological inference fallacy or population fallacy) is a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong.

-   Example: Inference and generalization mistake: All the French eat snails, what is obviously not true

4 types:

1.  confusion between ecological (population) correlations and individual correlations
2.  confusion between group average and total average
3.  [Simpson's paradox](https://de.wikipedia.org/wiki/Simpson-Paradoxon): as Simpson-Paradoxon (auch simpsonsches Paradoxon oder Simpson'sches Paradoxon, benannt nach Edward Hugh Simpson) ist ein Paradoxon aus der Statistik. Dabei scheint es, dass die Bewertung verschiedener Gruppen unterschiedlich ausfällt, je nachdem ob man die Ergebnisse der Gruppen kombiniert oder nicht. Dieses Phänomen tritt auf, wenn Störvariablen in der statistischen Analyse nicht betrachtet werden. Durch die Nichtbeachtung der Gruppen kommt es zu einer Scheinkorrelation.
4.  confusion between higher average and higher likelihood

![Simpson Paradox](figures/1.simpson-paradox.png){width="300"}

![Regression lines](figures/1.regression-lines.png){width="300"}

In the model with separate effects for the original variable $x_{ij}$ and the group mean

$$
Yij = \tilde{y}_{00} + \tilde{y}_{10}\hat{x}_{ij} + \tilde{y}_{01}\hat{x.}_j + U_{0j} + R_{ij}
$$

-   the within-group regression coefficient is $γ_{10}$ 
-   between-group regression coefficient is $γ_{10} + γ_{01}$

Test difference between within-group and between-group coefficients: consider $\tilde{γ}_{10}= y_{10}, \tilde{γ}_{01} = y_{10} + y_{01}$

Both models are equivalent, and have the same fit:

$$
\tilde{γ}_{10} = γ_{10}, \tilde{γ}_{01} = γ_{10} + γ_{01}
$$

### Model fit with R squares

$R^2$ = 1- (unexplained variance / total variance)

```{r}
multilevelR2(mod3, print = c("RB1", "RB2", "SB", "MVP"))
```

-   RB1 and RB2 returns the explained variance at level 1 and level 2, respectively, according to Raudenbush and Bryk (2002, pp. 74 and 79)
-   Specifying SB returns the total variance explained according to Snijders and Bosker (2012, p. 112).
-   Specifying MVP returns the total variance explained based on "multilevel variance partitioning" as proposed by LaHuis, Hartman, Hakoyama, and Clark (2014).

But note: Don't infer from $R^2$ if multi level should be used or not! Use the ICC for that

### Exercise

1.  Add a group level effect for IQ using the R commands

```{r create a group mean of IQ for each level}
groupM <- aggregate(DAT$IQ_verb, by=list(DAT$schoolnr), FUN=mean)
colnames(groupM) <- c("schoolnr", "IQ_mean")
DAT <- merge(DAT, groupM, by="schoolnr", all.x = TRUE)
```

With this command we add a group mean for the verbal IQ per school. With this group mean we separate the effects from each other, So we have the intercept for the school & the verbal IQ mean in school for explaining the between group variance and for the within group variance the verbal IQ

2.  Compare the RIM model with and without this group level variable, with a special focus on the level-2 variance. Explain!

```{r model with group variable}
mod4 <- lmer(langPOST ~ IQ_verb + IQ_mean + (1|schoolnr), data=DAT)
summary(mod3)
summary(mod4)
```

Model 1 (empty model with classes fixed), Model 2 (+ individual intelligence score), Model 3 (+group mean of intelligence score). → more explanation, group mean is significant!

```{r}
sigma(mod3)
sigma(mod4)
```

The variance in the random effects between model 3 and 4 is decreasing and more is explained. In mod4 the variance of the residuals and the residual standard deviation is decreasing. Also standard errors are slightly decreasing.

## Prediction on Random Effects

*only possible after model computation, because only then we have the estimates, we can predict from!*

The random effects $U_{0j}$ are not statistical parameters and therefore they are not estimated as part of the estimation routine. However, sometimes it might be necessary to predict them.

Note: It is often not a good idea, because their are predicted, not statistical facts!

There are two ways to get them:

1.  The ML estimator of $U_{0j}$ is the group means of the residuals $Yij − (\tilde{γ}_{00} + x′{ij} \hat{γ})$ → we have estimated the parameters out of the available data. The ML estimator of $U_{0j}$of the group means of the residuals is possible, because $\tilde{γ}_{00}$ is equal to zero. (See the @sec-ml.ex)

2.  By the empirical Bayes method; the resulting predictions are the posterior means. The posterior mean for group j is based on two kinds of information:

-   sample information: the data in group j population information:
-   the value U0j was drawn from a normal distribution with mean 0 and variance τ 2

Suppose we wish to predict the 'true group mean' $γ_00 + U_{0j}$ For simplicity consider empty model (without coefficient): $Y{ij} = β_0j + R{ij} = γ_00 + U0j + R{ij}$ → Estimating $β_0j$ and $U0j$ are equivalent problems given that $\hatγ_00$ is available

The empirical Bayes estimate (combined estimate) is a weighted average of the group mean and the overall mean: 

$$
\hat{\beta}^{EB}_{0j} = λj \overline{Y}_{.j} + (1 − λj )\hat{γ_00}
$$

where the weight $λ_j$ is the 'reliability' of the mean of group $j$ (with $n_j$ is related group size)

$$
λj =\frac{τ^2_0}{τ 2 0 + σ^2/n_j}
=
\frac{n_jρ_I}{1 + (n_j − 1)ρ_I}
$$

The reliability coefficient indicates how well the true mean $γ_00 + U_{0j}$ is measured by the observed mean $\overline{Y}_{.j}$.

![Empirical Bayes estimators](figures/1.Bayes.png){width="300"}

Empirical Bayes 'estimates' (EB) are biased, but the bias gets smaller with increasing group sizes.

We see that the EB are "shrunk towards the mean". (EB are also called shrinkage estimators.)

![Error Bayes](figures/1.Bayes-error.png){width="300"}

→ checking for standard errors is very important! Therefore, compute the comparative and diagnostic standard errors.

**Comparative** (use: comparison of random effects of different level 2 units): \* deviation from unobserved random variable $U_{0j}$

$$
S.E_{comp}(\hat{U}^{EB}_{0j}) = S.E. (\hat{U}^{EB}_{0j} - U_{0j}) 
$$ 

how well can the unobserved level-2 contribution of $U_{0j}$ can be 'estimated' from the data

**Diagnostic** (use: model checking (checking normality of the level-1 residuals)) 

-   Express the deviation from 0, which is the overall mean of the variables $U_{0j}$ 

$$
S.E_{comp}(\hat{U}^{EB}_{0j}) = S.E. (\hat{U}^{EB}_{0j}) 
$$

-   an be seen as standardized residuals having a std. normal distribution if model is correct

### Exercise

Estimates of random effects with `ranef`. This function gibes the conditional modes of the random effects of a fitted model object. → it is the difference between the population-level average predicted response for a given set of fixed-effects values and the response predicted for a particular individual.

-   You can think of these as individual-level effects, i.e. how much does any individual differ from the population? 
-   They are also, roughly, equivalent to the modes of the Bayesian posterior densities for the deviations of the individual group effects from the population means

→ important to keep in mind: not parameters, so no standard errors are defined.

1.  Predict the school-level random effects (EB, for all N = 211 schools), using the R command `ranef(model)` ('model' is name of estimated model in R)

```{r}
be_u <- ranef(mod4)$schoolnr[[1]] # schoolnr[[1]] is the bayes estimate for the level-1 unit for each school, computed in model 4
plot(1:length(be_u), sort(be_u), pch=20, xlab="schoolNr", ylab=expression('U'['0j'])) #plot all the bayes estimates for the schools sorted from the largest to the highest
mlm <- lm(sort(be_u) ~ c(1:length(be_u))) ## plot a line: estimate linear model first
abline(a=coef(summary(mlm))[1,1], b=coef(summary(mlm))[2,1]) ## then use intercept and slope of that model to plot the line
```

On the vertical axis is the $U_{0j}$ displayed, group-dependent deviation from the average for each school on the x-axis. If a linear regression fit the estimates perfectly, we had perfect normal distribution, because the group dependent variance is around zero.Looking at schools around the median (nearly 50% of the ranked schools), they are centered around $U_{0j}$ → good sign of a random effect.

2.  Plot the confidence intervals to show the differences within the school:

```{r}
plot_model(mod4, type = "re", col="blue", 
          title="Random Effects on School Level")
```

This plot shows the Bayes estimates for each of the 211 schools in the data set and their confidence intervals. We can see, there are varying means for each school and varying confidence intervals. This indicates variance on the school level and supports multi-level analysis.

Important: Not a robustness check, more a diagnosis, if the method is appropriate! In the attachment, not in the paper!

## Random Slope Model

It is possible that not only the group average of Y, but also the effect of X on Y is randomly dependent on the group. That is, in

$$
Y_{ij} = β_{0j} + β_{1j}x_{ij} + R_{ij}
$$

it is

$$
β_{0j} = γ_{00} + U_{0j}
β_{1j} = γ_{10} + U_{1j}
$$

Example: The effect of gender has a different effects on the grade in different classes.

This gives (X has a random slope): 

$$
Y_{ij} = γ_{00} + γ_{10}x_{ij} + U_{0j} + U_{1j}x{ij} + R{ij}
$$

-   $y_00$ intercept for overall population  $γ_{10}x_{ij}$ fixed effects part = fixed effect on gender!
-   $U_{0j}$ random effect of group dependent deviation in the class from the intercept
-   $U_{1j}x{ij}$ effect from random effect of class on the fixed effect gender!

Because we would get too much estimators, we need

**assumptions**: 

-   group-dependent coefficients ($U_{0j}$ ,$U_{1j}$) are independent across $j$ with a bivariate normal distribution with expected values (0, 0) and covariance matrix defined byVariance of level-1 units (schools)

$$
var(U_{0j}) = τ_{00} = τ^2_0
$$ Variance of the effect of level-1 units (random effect) on the individual x effect (IQ)

$$
(U_{1j}) = τ_{11} = τ_{21}
$$

Co-Variance of school number and IQ 

$$
 cov(U_{0j},U_{1j}) = τ_{01}
$$

This yields: A linear model for the mean structure, and a parameterized covariance matrix within groups with independence between groups.

#### Exercise {#sec-slope.ex}

1.  Add a random slope for IQ to your last model. For this: Have a look in the help function of lme4 package who this works.

```{r}
mod5 <- lmer(langPOST ~  IQ_mean + IQ_verb +  (IQ_verb|schoolnr),  #instead of an intercept (1), we insert IQ_verb as a random slope
             data=DAT) 
summary(mod5)
```

Model 4 and model 5 have really similar estimates with very similar standard errors and p-values, BUT: 

-   Random effect variance of groups is in model 4 slightly smaller (9.119, Std. Error 3.02) than in model 5 (9.3268, Std. Error 3.054)
-   Standard error in model 4 for IQ_verb(0.058) is smaller than in model 5 (0.06) and so the standard error for the intercept is
-   Model 5 does not decrease the variance is not an improvement in model fit. Though, it is more complex.

2.  Write the regression equation for the estimated model.

$$
Y_{ij} = γ_{00} + γ_{10}x_{ij} + U_{0j} + U_{1j}x{ij} + R{ij}\\
langPost_{ij}= γ_{00} + γ_{10}*IQverb_{ij} + y_{01}*IQmean_{j} + U_{0j} + U_{1j}*IQmean_{ij} + R_{ij}\\
Y_{ij} = 41.03 + 2.4864 *IQverb_{ij} +1.030*IQmean_{j} +U_{0j} + U_{1j}*IQverb_{ij} + R_{ij}
$$

3.What is the average of the slope for IQ, and what is the related standard deviation?

In model 5 the average of the slope for IQ is 2.46, it is the estimate for IQ_verb. The standard deviation is:

```{r }
#Multiply the SE (derived from the fixed effects, SE for the intercept) with the square root of the sample size.
0.06687 * (sqrt(3454))
```

4.  Plot the regression lines for all N=211 schools. What do you find concerning the variance of Y (language test score) related to X (IQ)? (One option for plotting: use the 'predict' function for each school separately and plot the predicted Y s along X.)

```{r}
schNr <- unique(DAT$schoolnr)
plot(0, xlim= range(DAT$IQ_verb), ylim=range(DAT$langPOST), col="white", xlab="IQ", ylab="langScore")
for(i in 1: length(schNr)) {
  pr <- predict(mod5, newdata=DAT[DAT$schoolnr %in% schNr[i],], type="response")
  iq <- DAT[DAT$schoolnr %in% schNr[i],"IQ_verb"]
  res <- cbind(iq,pr)
  lines(res[,1], res[,2])
}
abline(v=0)
```

In the plot you can see a regression line for the connection between the verbal IQ and the score in the language test. While for lower verbal IQs there is more variance in the language test score, it is less variance for pupils with higher IQ. This could be an indicator, that pupils with an higher IQ have a higher probability to get higher language test scores, while the scores for pupils with lower IQs are more dependent on whhich school you are.

This is a sign for heteroscedacity.

## Cross-Level Interaction

What can we find beyond the random slope?

We add level-2 covariates:

Intercept in the specific group. Is described by the grand mean $γ_{00}$, the random term $U_{0j}$ and by something on the other level $γ_{01}z_{j}$

$$
β_{0j} = γ_{00} + γ_{01}z_{j} + U_{0j}
$$ Coefficient on the slope, the coefficient on the other level and the random term on the other level.

Is described by the grand mean$γ_{10}$, the random term $U_{1j}$ and by something on the other level $γ_{11}z_{j}$.

$$
β_{1j} = γ_{10} + γ_{11}z_{j} + U_{1j}
$$

Substitution yields

$Y{ij} = (y_{00} + y_{01}1z_j + U_{0j})$ (intercept part) $+ (y_10 + y_11z_j + U_{1j})x_{ij}$ (slope part) + $R_{ij} (residual part)$

$= y_00$ (grand mean) $+ y_{01}1z_j$ (fixed effect of the teachers gender on the score)+ $y_{10}x_{ij}$ (fixed effect of the gender slope part) + $y_{11}z_{j}x_{ij}$ (cross-level-interaction effect) fixed effect of the teachers gender + the gender on the student of the score) + $U_{0j}$ (random part) + $U_{1j}x_{ij}$(random slope part of the gender) + $R_{ij}$ residiual part)

Note: \* Random effects are still in the model, it don't have to be included, but can! \* Two level equation is brought to one long equatation.

### Exercise {#sec-cross.ex}

1.  Add a cross-level interaction effect for IQ and IQ to your last model. (Hint: Mind that this is still an interaction effect, commonly quantified by the product of two variables.)

```{r}
mod6 <- lmer(langPOST ~  IQ_mean #group level effect
             + IQ_verb  #individual effect and main predictor
             + IQ_mean*IQ_verb #interaction effect
             + (IQ_verb|schoolnr), #group effect on the individual effect
             data=DAT)
summary(mod6)
```

2.  What does this effect tell you in the considered case?

Including the cross-level interaction effect decreases the variance in the random effects in comparison to model 4 and 5. Especially the variance in IQ_verb is very small now. Its a slightly better model than model 4.

What does the model say us? The higher the IQ mean in a class, the IQ counts less for the score and vice versa.

### Exercise Model comparision

1.  Global intercept model / empty model

```{r}
m1 <- lm(langPOST ~ 1, data=DAT)
summary(m1)
```

The intercept in the global intercept model is 41.32. This is the mean score for the sample.

See:

```{r}
DAT %>% summarize(avg=mean(langPOST), sd=sd(langPOST), se= sd/sqrt(3454))
```

2.  Two Explanatory variables, IQ and ses on level 1

```{r}
m2 <- lm(langPOST ~ IQ_verb + ses, data =DAT)
summary(m2)
```

Both explanatory variables and the intercept are significant. Intercept is near to the overall intercept.

-   indicates correlation between verbal IQ and language test score: With each verbal IQ point more, the score in the language test will increasee by 2.36.

-   indicates correlation between socio economic background and language test score: Whith each point on the ses scale more, the score will increase by 0.15

3.  two explanatory variables IQ and SES on level-2 (with $\overline{.}$ denotes the group mean)

```{r}
#add group mean for ses to the data set
groupM <- aggregate(DAT$ses, by=list(DAT$schoolnr), FUN=mean)
colnames(groupM) <- c("schoolnr", "SES_mean")
DAT <- merge(DAT, groupM, by="schoolnr", all.x = TRUE)
m3 <- lm(langPOST~  IQ_mean + SES_mean, data=DAT)
summary(m3)
```

On level 2 we have significant effects, too. The verbal IQ of a school and the ses mean of school have also significant effects on the language test score.

→ this indicates, that the higher the verbal IQ or the ses mean on a school is, the higher are the language test scores and vice versa.

4.  An interaction effect for IQ and SES on level-1

```{r}
m4 <- lm(langPOST ~ IQ_verb + ses + IQ_verb*ses, data = DAT)
stargazer(m2,m4, type="text", out = "ml.html")
```

Overall, with a look on $R^2$ and the Adjusted $R^2$ model 4 has a slightly better model fit to explain the variance on level one. All explanatory variables are still significant.

Its very interesting: The positive effect of IQ_verb is slightly decreasing, while the positive effect of ses is slightly increasing, when the interaction term is included. This means, that an increase in ses will decrease the significance and the effect of IQ or vice versa.

→ It could be a hint, that with with a high level socio economic background a low IQ has less effect or with a low level ses a high verbal IQ has less effect on the test score.

5.  An interaction effect for IQ and SES on level-2

```{r}
m5 <- lm(langPOST~  IQ_mean + SES_mean + IQ_mean*SES_mean, data=DAT)
stargazer(m3,m5, type="text", out = "ml.html")
```

Both models has a small and $R^2$ in consequence on its own few explanatory power. Though, the model with the interaction term has higher $R^2$, what is an indicator for better model fit.

Adding the interaction term decreases the effect of the mean IQ and the mean SES in a school. Like on level 1 there is a negative and significant effect of the interaction term.

→ This one is a bit tough to interpret: If the IQ in one school is on average higher than in another, a higher ses on average has a lower effect on the language test school than in the other school. If the ses in a school on average is very high, the mean IQ on this school will have a lower effect than on a school with lower ses on average.

6.  a cross-level interaction effect for $IQ$ and $\overline{IQ}$

See @sec-cross.ex Cross-level interaction:

```{r}
m6 <- lmer(langPOST ~  IQ_mean #group level effect
             + IQ_verb  #individual effectof IQ
             + IQ_mean*IQ_verb #interaction effect
             + (IQ_verb|schoolnr), #group effect on the individual effect
             data=DAT)
summary(m6)
```

We can see that the mean IQ in schools and IQ of an individual have both positive effects on the language score test.

The following hypotheses can be deduced from the results:

-   The higher the verbal IQ, the higher the probability of a better language score results 
-   The higher the IQ is in a school on average, the higher is the probability for pupils to get better language score results. This could be explained by the mechanism, that in school the IQ is high on average, the individual IQ for a majority of pupils is also very high. 
-   The cross interaction effect is negative: If the IQ is in its mean higher in one school, there is lower effect for the individual IQ or vice versa. 
-   That makes sense, because grades and test results are often relational. If there are pupils with very high IQ in one class, a high IQ is not that outstanding and to get good scores is more difficult.

7.  A cross-level interaction effect for $ses$ and $\overline{SES}$

```{r}
m7 <- lmer(langPOST ~  SES_mean #group level effect
             + ses  #individual effect of ses
             + SES_mean*ses #interaction effect
             + (ses|schoolnr), #group effect on the individual effect
             data=DAT)
summary(m7)
```

We can see that the mean SES in schools and the ses of an individual have both positive effects on the language score test, but only the individual ses is significant and moderate high. The cross interaction effect is negative and very small.

-   The higher the ses of a pupil, the higher the probability to get good test scores. 
-   If the school has on average a high the ses, the effect of the individual on test scores decreases. This makes sense, because like intelligence the socio-economic background is perceived relational in a school.

8.  a cross-level interaction effect for $IQ$ and $\overline{SES}$

```{r}
m8 <- lmer(langPOST ~  SES_mean #group level effect
             + IQ_verb  #individual effect of IQ
             + SES_mean*IQ_verb #interaction effect of the SES in a school on the individual IQ
             + (IQ_verb|schoolnr), #group effect  on the individual effect
             data=DAT)
summary(m8)
```

Not controlled for individual ses, we have a positive effect of the ses on school level and a positive effect of the individual IQ, without controlled for school level IQ. Both effects and the cross level interaction effect is significant. The cross level interaction effect is negative.

-   The higher the ses of pupils on one school is on average, the higher the probability to get high test results (not controlled for individual values of ses) 
-   The higher the the IQ, the higher the probability to get high test results (not controlled for group level values of IQ) 
-   If on a school the average ses is high, the effect of individual IQ on the probablity to get high test results are decreasing and vice versa

9.  A cross-level interaction effect for $ses$ and $\overline{IQ}$

```{r}
m9 <- lmer(langPOST ~  IQ_mean #group level effect
             + ses  #individual effect of ses
             + ses*IQ_mean #interaction effect of the IQ n a school on the individual ses
             + (ses|schoolnr), #group effect  on the individual effect
             data=DAT)
summary(m9)
```

There are positive, significant effect of the mean IQ in a school and the individual ses on the language test score and a negative, significant cross level interaction effect.

-   The higher the IQ of pupils on one school is on average, the higher the probability to get high test results (not controlled for individual values of IQ) 
-   The higher the the individual ses, the higher the probability to get high test results (not controlled for group level values of ses)
-   If on a school the average IQ is high, the effect of individual ses on the probability to get high test results are decreasing and vice versa

10. a random intercept on school level and a random slope for IQ

See @sec-slope.ex Random Slope Model:

```{r}
m10 <- lmer(langPOST ~  IQ_mean + IQ_verb +  (1+IQ_verb|schoolnr),  #instead of an intercept (1), we insert IQ_verb as a random slope
             data=DAT) 
summary(m10)
```

In this model not only the difference in the intercept is random, but also the slope. The slope is determined by the individual IQ, too.

This means, that not only the individual IQ and the IQ on school level has an effect on the test results, the effect of IQ has different effects on different classes.

If I had to choose for one of these models, I would go with the last one. For me, it is a bit neglectful to measure cross level interactions without taken individual effects for the explanatory variables and the group level effects into account. This is especially true for model 9. Model 6 give hints, that the individual effect of IQ is critical for better scores and is weakened by the the mean IQ. Consequently, to test, if the mean IQ is explanatory for better language test results without regarding the individual level appears a bit strange and contradictory. What assumptions should derive from that? That an overall higher IQ level in a school is a cause for better language tests for the individuals? It is a bit far away, to assume, that pupils with higher verbal IQs help the others out and therefore language test results are better for the pupils in the schools, than to assume, that the individual IQ level at a school and a school with a lot high level IQ pupils will have better test results than a school with less high level IQ pupils.

The last one has also the lowest variance in the residuals of random effects.

Another option is to compute random intercept models and random slope models, combining SES and IQ as explanatory variables. Like always, it depends on theory, which model makes sense and then test for statistical inference. For example behind model 4 lays the assumption, that if you have a higher ses, your IQ is less important for test results.

## Statistical Testing

For the fixed parts, the $\gamma$ use the t-test with test statistic (compares the mean between groups):

The estimates comes from the sample data and are in consequence not the true scores → estimates are dependent on data and not taken for true

$$
T(γh) =
\frac{\hat{γ}}
{S.E.(\hat{γ}h)}
$$

(The F-test (test for variance between samples or groups) or Wald test for testing several parameters simultaneously.) The standard error should be based on REML estimation. → robust estimator and really recommended.

Degrees of freedom for the t-test / the denominator of the F-test

-   For a level-1 variable: $N − r − 1$, where N = total number of level-1 units, r = number of tested level-1 variables
-   For a level-2 variable: $M − q − 1$, where M = number of level-2 units, q = number of tested level-2 variables
-   For a cross-level interaction: M − q − 1, where q = number of other level-2 variables interacting with the level-1 variable
-   Note: If d.f. ≥ 40, the t-distribution can be replaced by a standard normal

→ the more variables I add, the less power my model is going to have

### Exercise

Extended model like in the R script:

```{r}
mod7 <- lmer(langPOST ~  IQ_verb + ses + 
                         IQ_mean + SES_mean +  
                         IQ_verb*ses +
                         IQ_mean*SES_mean +
                         IQ_mean*IQ_verb + 
                         SES_mean*ses + 
                         IQ_mean*ses + 
                         SES_mean*IQ_verb +                          
                      + (IQ_verb|schoolnr), data=DAT)
summary(mod7)
```

1.  & 2. Based on a t-test: Which explanatory variables should stay in the model?, Based on a t-test: Which interaction effects should stay in the model?

Significance indicated by p can be taken out of the models → using the p-values from the t-test.

The `lmertest` test the linear models and give the p-values for the explanatory variables. It computes for fixed effects the significance with t-tests. Looking at the results displayed in the summary the following variables and interaction effects should stay in the model:

effects on individual levels 

-   IQ_verb 
-   ses

effects
-   IQ_mean 
-   SES_mean

cross-interaction effects: 
-   IQ_verb:ses 
-   IQ_mean:SES_mean

3.  Reduce your model according to your findings.

```{r}
mod8 <- lmer(langPOST ~  IQ_verb + ses + 
               IQ_mean + SES_mean +  
               IQ_verb*ses +
               IQ_mean*SES_mean +
               (IQ_verb|schoolnr), data=DAT)
summary(mod8)
```

In the reduced model all variables are still significant. SES_mean has only a significance level of 0.055, so we could think about kicking it out.

→ The individual IQ has a positive, significant effect on the score. If the IQ is one point higher, the language score result will probably increase by 2.17. Also the ses has a positive, significant effect on the score.If the ses increase by one point, it increase 0.17.

→ If the average IQ on a school is higher, than the language score test are probably higher, too. And if the average SES is higher on a school, the score will increase, too.

→ There is an interaction effect between IQ and ses on individual and school level: The positive, significant effect of IQ is damped by the ses or vice versa. A low level in ses dampens the positive effect of an high IQ on a language test score. OR: A higher level in ses compensate low levels in IQ.

→ There is also an interaction effect on the school level. This could be, because of the strong correlation between the IQ_mean and SES_mean.

Model 8 has compared to all other models the smallest variance on school level and in the residuals and consequently, will have the a higher ICC than the other models.

## Testing random effects with the lieklihood-ratio test (LRT)

Difference between linear regression and HLM is in the $U_{ij}$, the $\tau^2$

-   Assesses goodness of fit of two nested statistical models (model 1 and model 2) based on the ratio of their likelihoods

-   Models are considered as [nested](https://joshuawiley.com/MonashHonoursStatistics/LMM_Comparison.html), when one model is a restricted or constrained version of another model

-   Data is fixed, we can compute different models. The likelihood Λ gives the probability that the parameters fits the distribution of the sample, comparing different models

-   Model 2 has parameter space $θ$ ∈ $Θ$, with $θ$ is unknown / to be estimated

    -   H0: $Θ$ (the complexer model 2) is in subset $Θ_0$ of $Θ$\
    -   Simpler model 1 has parameter space $Θ_0$, more complex model 2 parameter space $Θ$
    -   Null Hypothesis: Model 2 are a subset of the linear regression, therefore model 2 and $U_{ij}$ is not necessary → the more complexer model is equal to the simpler model
    -   More complex model 2 can be transformed into simpler model 1 by imposing constraints on model's 2 parameters: model 1 is nested in model 2 → model 1 is nested into model 2, while model 2 is tested, because in model 1 are parameters set to 0 and consequently its simpler

-   Usually: model 2 found by maximization over the entire parameter space and model 1 found after imposing some constraint

→ key question: what is tested (model 2) and what is nested (model 1)?

Note: Model 1 doesn't have to be a linear regression, could also be a simpler HLM compared to a more complex model 2

Note: If two models are nested, then we have the most options in terms of comparing the two models. For example, we can evaluate whether Model A is a statistically significantly better fit than is Model B using a [**Likelihood Ratio Test (LRT)**](https://joshuawiley.com/MonashHonoursStatistics/LMM_Comparison.html)

-   LR test tests whether related likelihood ratio is significantly different from one
-   LR test statistic is the deviance difference:

$$
λ_{LR} = −2 ln
\frac{supθ∈Θ0Λ(θ)}
{supθ∈ΘΛ(θ)}
$$

$$
λ_{LR} = −2[l(θ_0) − l\hat{θ})]
$$

where $l(\hatθ)$ is the logarithm of the maximized likelihood function $Λ$ (under model 2), and $l(θ_0)$ is the maximal value if H0 is true (i.e. under model 1).

Higher likelihood means better fit of the data.

![Likelihoood Ratio](figures/1.likelihood.png){width="300"}

The deviance difference is the difference between the highest points between the two curves. In this case it is 2. If the difference is zero, we don't need the more complex model and can reject H0.

H0 is true means that model 2 (tested model) is not sign. better in fitting the data than model 1 (nested model).

Multiplication with −2 ensures $λ_{LR}$ assym. converges to $χ^2$ distribution under H0

Small p-value (related to $χ^2$ distr.) indicates that H0 is wrong, i.e. evidence for model 2

#### Deviances and $\lambda_{LR}$

We have (with $\hat{θ}$ denotes the fitted parameters for the saturated model(include more or too much parameter): 

-   Deviance model 1: $D1 = −2(l(\hatθ_s ) − l(\hatθ_1))$ with $\hatθ_1$ fitted parameters of model 1, $|θ_1 = m1$ 
-   Deviance model 2: $D2 = −2(l(\hatθ_s ) − l(\hatθ_2))$ with $\hatθ_2$ fitted parameters of model 2, $|θ_2| = m2, m1 < m2 $

Then: The test statistic for testing model 2 against model 1 is:

$$
\hat{\lambda}_{LR} = −2(l(\hatθ_2) − l(\hatθ_1)) = D1 − D2
$$

When the deviance is zero or near to zero, we don't need a multilevel model.

If the null hypothesis of no difference is true in the population, then $λ$ will follow a chi-squared distribution with degrees of freedom equal to the number of parameters constrained to 0 in Model 1 from Model 2, the difference in degrees of freedom used for each model, that is:

$$
\lambda  \sim \chi^2(df1-df2)
$$

Thus we often use a chi-square distribution in the LRT to look up the p-value. Finally, note that because LRTs are based on the log likelihoods (LL) from a model, we need true log likelihoods for the LRT to be valid. Therefore, we cannot use restricted maximum likelihood, we need to use maximum likelihood estimation.

One special circumstance: variance parameters are necessarily positive. Therefore, they may be tested **one-sided.**

E.g., in the random intercept model: $H_0 : τ^ 2_0 = 0$ 
Asymptotic distribution of the deviance difference is a mixture of a point mass at 0 (with prob. 0.5) and a $χ^2$ [[distribution](https://studyflix.de/mathematik/chi-quadrat-verteilung-1090) (with prob. 0.5) Interpretation: if observed between-group variance ≤ as expected under H0 (which happens with prob 0.5) the estimate is $τ6 2 0 = 0$ and the log-likelihood ratio is 0 The test works as follows: 

-   if deviance difference $= 0$: no significance 
-   if deviance difference $> 0$: calculate p-value from $X^2_1$ and divide by 2

*d* stands for the degrees of freedom → the parameters you have added

![chisquared distribution](figures/1.x2.png){width="200"}

## Random Slope Variances

As before: LR test Number of tested parameters (variances & covariances): $d + 1$ → because another parameter is added to model 2

E.g.: testing for a random slope in a model that further contains the random intercept but no other random slopes: $d = 1$ 

-   p-values can be obtained as the average of the p-values for the $χ^2_d$ and $χ^2_{d+1}$ distributions

![P-values for random slopes](figures/1.p-value.png){width="200"}

#### Exercise

Data (as before) Tasks: Which model does give the better model fit? Test it!

```{r create models}
#Linear Model, simple model
mod9a <- lm(langPOST ~  IQ_verb + ses + 
               IQ_mean + SES_mean +  
               IQ_verb*ses +
               IQ_mean*SES_mean, data=DAT)
summary(mod9a)

#Random Intercept, more complex model
mod9b <- lmer(langPOST ~  IQ_verb + ses + 
               IQ_mean + SES_mean +  
               IQ_verb*ses +
               IQ_mean*SES_mean +
               (1|schoolnr), data=DAT)
summary(mod9b)

#, Random Slope, most complex model
mod9c <- lmer(langPOST ~  IQ_verb + ses + 
               IQ_mean + SES_mean +  
               IQ_verb*ses +
               IQ_mean*SES_mean +
               (IQ_verb|schoolnr), data=DAT)
summary(mod9c)
```

The following steps for LRT are from this [website](https://joshuawiley.com/MonashHonoursStatistics/LMM_Comparison.html#non-nested-models)

Check that all models are fitted to the same data and the same observations are included:

```{r}
nobs(mod9a)
nobs(mod9b)
nobs(mod9c)
```

Get the log Likelihoods for the models:

```{r}
logLik(mod9a) # simple model
logLik(mod9b) # more complex model
logLik(mod9c) # most complex model
```

Compute lambda for mod9a and mod9b

```{r}
#lamdba for more complex model (nested) and most complex model (tested)
-2* (-11545.01 - -11368.69)
```

```{r}
# p-value from a chi-square
pchisq(352.64, df= 1, lower.tail= FALSE)
```

→ reject the H0, go with the more complex model

Compute lambda for mod9b and mod9c

```{r}
#lamdba for more complex model (nested) and most complex model (tested)
-2* (-11380.82 - -11368.69)
```

```{r}
# p-value from a chi-square, in this case 2 df, because model 9b has 9 and model 9c has 11
pchisq(24.26, df= 2, lower.tail= FALSE)
```

→ also in this case reject the H0, go with the more complex model

In practice, we don't have to do it by hand, we can use the `ranova` function.

The first test compares the reduction from the random intercept to not include a random intercept like in mod9a, the random intercept is removed (see for details the description for the [`ranova` function](https://www.rdocumentation.org/packages/lmerTest/versions/3.1-3/topics/ranova))

```{r}
ranova(mod9b)
```

```{r}
ranova(mod9c)
```

```{r}
#other option:
anova(mod9b, mod9c, refit=FALSE)
```

→ also in this case reject the H0, go with the most complex model

## Explained Variance

In ordinary linear models we compute $R^2$ for model $ω$ as the proportionate reduction in the RSS (residual sum of squares) starting from the null model $ϕ$:

$$
R2 = 1 − \frac{RSS(ω)}{RSS(ϕ)}
$$

In a two level random intercept model we can define $R^2$ (with a total model variance of $σ^2 + τ^2_0$):

$$ 
R^2 = 1 − \frac{\hatσ^2(ω) + τ^2_0(ω)}{\hatσ^2(ϕ) + τ^2_0(ϕ)}
$$

This can also be calculated by level:

For macro level 2:

$$
R^2_{L2}= 1- \frac{ τ^2_0(ω)}{ τ^2_0(ϕ)}
$$ For micro level 1:

$$ 
R^2_{L1} = 1 − \frac{\hatσ^2(ω) }{\hatσ^2(ϕ) }
$$ *Note*: Unlike linear models R2 is not guaranteed to increase when variables are added!

### Exercise

You have the following models and the related variance estimates: Note: Model A. corresponds to the null model; Model D. to the substantial model.

![](figures/1.ex-rss.png){width="200"}

Compute $R^2$, $R^2_{L1}$, $R^2_{L2}$

```{r}
# Level 1
1 - (6.973/8.694)

# Level 2
1 - (0.991/2.271)

# R2
1 - ((6.973+0.991)/(8.694+2.271))
```

Other way to do it in R is with

```         
library(mitml)
multilevelR2(mod, print=c("RB1",  #explained variance on level 1
"RB2", #explained variance on level 2 both according to Raudenbuch and Bryk (2002, pp. 74 and 79)
"SB", #total variance explained according ot Snijders and Bosker (2012, p.112)
MVP") #total variance explained based on multilevel variance partioning by LaHUis, Hakoyama and Clark (2014)
```

## Heteroscedasticity

Heteroscedasticity can be modelled in ML-Models, i.e. residual variance depends on observed variables.

E.g., random part at level one $= R_{0ij} + R_{1ij}x_{1ij}$

Here the level-1 variance is a quadratic function of $X$: 

$$
var (R_{0ij} + R_{1ij}x_{1ij} ) = σ^2_0 + 2σ_{01}x_{1ij} + σ^2_1x^2_{1ij}
$$

For $σ^21 = 0$, this is a linear function:

$$
var (R_{0ij} + R_{1ij}x_{1ij} ) = σ^2_0 + 2σ_{01}x_{1ij}
$$

This is possible as a variance function, without random effects interpretation.

*Example*:

You find $\hatσ^2_0=σ 37.85$ and $\hatσ_{01} = −1.89$ is a gender effect on the variance term

Then: The estimated residual (level-1) variance is 37.85 for boys and $37.85 − 2 × 1.89 = 34.07$ for girls

*Example in R:*

Heteroscedasticity with respect to IQ:

```{r}
library(nlme)
mod <- lme(langPOST ~  IQ_verb + ses + 
               IQ_mean + SES_mean +  
               IQ_verb*ses +
               IQ_mean*SES_mean,
data=DAT,random = ~ IQ_verb|schoolnr,
weights=varFixed(~IQ_verb))
summary(mod)
```

*Note:* From lme function's help: weights: one-sided formula describing the within-group heteroscedasticity structure.

## Diagnostics

$$
Y_{ij} = γ_0 + \sum_{h=1}^r \gamma_h x_{hij} + U_{0j} +\sum_{h=1}^p U_{hj}x_{hij} + R_{ij}
$$

**What has to be checked:** 
-   Does the fixed part contain the right variables? → Test effects and group means 
-   Does the random part contain the right variables? → Test effects and group means 
-   Are the level-1 residuals normally distributed? →Level-1 residual analysis 
-   Do the level-1 residuals have constant variance ?→ Study heteroscedasticity 
-   Are the level-2 random coefficients normally distributed with mean 0? → Level-2 residual analysis and level-1 residual analysis, 
-   Do the level-2 random coefficients have a constant covariance matrix? → Level-2 residual analysis

### Are there contextual effects not considered? Test effects and group means

-   For every level-1 variable $X_h$ check the fixed effect of the group mean $\overline{X}_h$. 
-   $U_0j$ must not be correlated with the $X_{hij}$ + Include the fixed effect of $X_h$ if it is significant, and continue to use a random effects model. (Also check effects of variables $X_h·_jZ_j$ for cross-level interactions involving $X_h$) + Random slopes $U_hj$ must not be correlated with the $X_{kij}$ → This can be checked by testing the fixed effect of $X_{k·j}X_{hij}$

### Level-1 residual analysis

Level-1 specification can be studied by disaggregation to the within-group level 

-   Test the fixed part of the level-1 model using OLS level-1 residuals, calculated per group separately 
-   Test the random part of the level-1 model using squared standardized OLS residuals (check for homoscedasticity)

*Note:* The construction of the OLS within-group residuals implies that this tests only the level-one specification, independent of the correctness of the level-2 specification

![level-1 residual analysis](figures/1.level1res.png){width="500"}

#### Exercise:

Model: We have the following 2-level Model: 

-   with IQ, SES, gender, IQ, SES as explanatory variables 
-   with an interaction effect between IQ and SES 
-   with a random intercept for the school level and a random slope for IQ

Task: Conduct Level-1 residual analysis for that model

Estimate the related within model and extract the residuals.

```{r}
DAT_w <- DAT #create a new data set 

# get all the residuals for the fixed parts of level 1 by substracting the individual score from the group mean
DAT_w$IQ_verb <- DAT_w$IQ_verb - mean(DAT_w$IQ_verb)
DAT_w$langPOST <- DAT_w$langPOST - mean(DAT_w$langPOST)
DAT_w$ses <- DAT_w$ses - mean(DAT_w$ses) 

# test the random part of level-1 model by using squared standardized OLS residuals and get the residuals

res_w <- lm(langPOST ~ IQ_verb + ses + sex + IQ_verb*ses, data=DAT_w)$residuals

summary(res_w)

# get the rounded values for IQ_verb residuals from the minimum to the maximum
range(round(DAT_w$IQ_verb))
```

Study the residuals in comparison to the related IQ value, by forming mean residuals in IQ categories and by plotting. Add to each mean residual in each IQ category formed a measure of uncertainty, e.g. ± twice the std. dev. of the estimated mean residual Hint: Use R code snippets given at the next slide.

```{r}
RES <- NULL
for(iq in seq(from=-8, to=7, by=0.5)){
  res_cat <- res_w[DAT_w$IQ_verb >= iq & DAT_w$IQ_verb <= (iq+1)]
  RES <- rbind(RES, c(iq, mean(res_cat), 2*sqrt(var(res_cat)/length(res_cat))))
}
plot(RES[,1], RES[,2], pch=20, xlab="IQ", ylab="mean residual (IQ categ.)", ylim=c(-5,3))
for(i in 1:nrow(RES)){
  lines(rep(RES[i,1],2), c(RES[i,2]-RES[i,3], RES[i,2]+RES[i,3]))
}
```

What do you see? What do you conclude?

The first plot shows a curvilinear effect of IQ, which indicates that the assumption of heteroskedasticity is violated.

**Task**: Check normality of the residuals using a normal probability plot of standardized level-1 OLS residuals What do you conclude? For Standardizing: divide residuals by their standard derivation (function sd in R) Use the R function `qqnorm` and `qqline` for that task.

```{r}
res_std_w <- res_w/sd(res_w)
qqnorm(res_std_w)
qqline(res_std_w)
```

The normal Q-Q Plot shows, that while there is a normal line to the center, on the tails end especially in the lower end there is a higher deviation, what could be a sign for heteroskedasticity.

If the assumption of homoskedasticity is violated, transform! More infomormation [here!](https://ademos.people.uic.edu/Chapter18.html#62_assumption_2_homogeneity_of_variance)

**or use a quadric**

![quadric IQ](figures/1.level1quad.png){width="600"}

### Level-2 residual analysis

Estimated level-2 residuals are always confounded with the estimated level-1 residuals: 

-   first conduct level-1 residual analysis, then conduct level-2 residual analysis 
-   For level-2 residual analysis use the empirical Bayes estimates of the level-2 random effects (i.e., used as level-2 residuals) 
-   E.g.: plot unstd. level-2 residuals as a function of relevant level-2 variables to check for possible nonlinearity 
-   Normal probability plots may be made by std. level-2 residuals 
-   Squared residuals may be plot as a function of level-2 variables to check for homoscedasticity 
-   *Note*: Deviation from normality at level-2 might also be caused by an incorrect specification of fixed effects

#### Exercise

**Model**: We have the following 2-level Model: 

-   with IQ, IQ2−, IQ2+ , SES, gender, $\overline{IQ}$, $\overline{SES}$ as explanatory variables 
-   with an interaction effect between IQ and SES 
-   with an interaction effect between $\overline{IQ}$ and $\overline{SES}$ 
-   with a random intercept for the school level and a random slope for IQ

**Task:** Conduct Level-2 residual analysis for that model Estimate the related within model and extract the level-2 residuals (Mind that these residuals can be described by the estimated random effects)

```{r}
#compute IQ squared minus and IQ squared plus to have homoskedasticity
?ifelse
DAT$IQsq_min <- ifelse(DAT$IQ_verb<0, DAT$IQ_verb^2, 0)
DAT$IQsq_pl <- ifelse(DAT$IQ_verb<0, 0, DAT$IQ_verb^2)

#create a model including this new variables
mod10 <- lmer(langPOST ~  IQ_verb + IQsq_min + IQsq_pl + ses + 
                IQ_mean + SES_mean +  
                IQ_verb*ses +
                IQ_mean*SES_mean +
                (IQ_verb|schoolnr), data=DAT)
summary(mod10)
```

Study the level-2 residuals in comparison to the related group mean of IQ, by plotting. Add a smoothed line of the residuals to assess whether the chosen explanatory variables and random effects match the assumptions (key word: linearity) For this use the R function `smooth.spline`

```{r}
be_u <- ranef(mod10)$schoolnr[[1]]
RES <- cbind(unique(DAT$schoolnr), be_u, DAT[!duplicated(DAT$schoolnr), "IQ_mean"])
plot(RES[,3], RES[,2], xlab="IQ_mean", ylab="U_0j")
smooth <- smooth.spline(x=RES[,3], y=RES[,2])
points(smooth$x, smooth$y, pch=20)
```

The smooth spline indicates homoskedasticity, the assumption is supported by this plot.

Check normality of the residuals using a normal probability plot of standardized level-2 residuals Overall: What do you see? What do you conclude?

```{r}
res_std_w <- be_u/sd(be_u)
qqnorm(res_std_w)
qqline(res_std_w)
```

We see a straight line with more deviation at the edges, which could be a hint that there is heteroskedasticity is going on.