lectures/hierarchical_modelling/hierarchical_modelling.qmd

---
title: "Hierarchical Bayesian modelling"
subtitle: ""
author:
 - name: "Elizaveta Semenova"
   email: "elizaveta.p.semenova@gmail.com"
institute: "Computer Science Department, University of Oxford"
date: 2024-08-21
date-format: medium
title-slide-attributes:
  data-background-color: "#f3f4f4"
  data-background-image: "../../assets/bmeh_normal.png"
  data-background-size: 80%
  data-background-position: 60% 120%
format:
  revealjs:
    slide-number: true
    incremental: true
    chalkboard:
      buttons: false
      preview-links: auto
    logo: "../../assets/bmeh_normal.png"
    theme: [default, ../../assets/style.scss]
---

# Outline

- Assumptions of linear models
- Why are linear models not enough?
- Two extremes of hierarchy: <ins>complete pooling</ins> and <ins>no pooling</ins>
- The golden middle: <ins>partial pooling</ins>
- Random effects
- Types of nestedness

## Assumptions of linear models

- Homoskedasticity

<center>![](assets/dogs1.jpeg){width=40%}

## Assumptions of linear models

- Homoskedasticity
  - equal or similar variances in different groups
  - sometimes it is appropriate
  - ![](assets/dogs1.jpeg){width=30%}


## Assumptions of linear models

- Homoskedasticity?

<center>
![](assets/dogs1.jpeg){width=40%}
![](assets/dogs4.jpeg){width=40%}

## Assumptions of linear models
- Homoskedasticity: equal or similar variances in different groups
- No error in predictors *x*
- No missing data
- Normally distributed errors
- Observations *$y_i$* are independent

## We want
- To model variance
- Capture errors in variables
- Missing data models
- Use generalised linear models (GLMs)
- Use spatial and/or temporal error structure

# Hierarchical models

Exist on the continuum of two extreme cases:

- <ins>complete pooling</ins>
- <ins>no pooling</ins>


## Complete pooling
<center>
![](assets/pooled_model.png){width=60%}

- A *pooled* model implies that the data are sampled from the same model.
- This ignores all variation  among the units being sampled.
- All observations share common parameter $\theta.$

<font size="1"> Image: courtesy of Chris Fonnesbeck</font>


## Excercise

- Complete pooling?

<center>
![](assets/dogs1.jpeg){width=30%}
![](assets/dogs4.jpeg){width=30%}

## Excercise

- Complete pooling:

<center>
![](assets/dogs1.jpeg){width=30%}
![](assets/dogs4.jpeg){width=30%}

- It ignores any differences between measurement blocks and does not acknowledge variability block to block.
- Observations within each unit are more likely to be like each other.

## No pooling

<center>
![](assets/unpooled_model.png){width=60%}

- An *unpooled* model implies that the data are sampled from independent parameters.
- Parameter $\theta_i$ is different for each observation.

<font size="1"> Image: courtesy of Chris Fonnesbeck</font>

## Excercise

- No pooling?

<center>
![](assets/dogs1.jpeg){width=30%}
![](assets/dogs4.jpeg){width=30%}

- Captures variability, but not the overall pattern.
- Assumes that there is no relationship between the models.
- Does not provide ways to extrapolate.


## Partial pooling
<center>
![](assets/partial_pooled_model.png){width=60%}

- In a *partially pooled* or *multilevel* or *hierarchical* model, parameters are viewed as a sample from a distribution of parameters.

## Hierarchical models

- Hierarchical models fill in the continuum between the two extremes.
- They allow us to estimate models for each measurement block where each dataset is being fit to its own model. - There is a higher level model, a *hierarchical model*, describing variability in parameters of those sub-models.

## Excercise

- Partial pooling?

<center>
![](assets/dogs1.jpeg){width=30%}
![](assets/dogs4.jpeg){width=30%}
</center>

- Variability can be captured at different scales.
- We can make predictions.


## Example
Normal model: common mean

$$y_i \sim N(\mu, \sigma^2), \quad i=1, ..., N$$

## Example

Normal model: common variance

$$y_i \sim N(\mu_i, \sigma^2) $$
$$\mu_i \sim N(\mu, \tau^2) $$
$$\sigma^2 \sim IG(\alpha, \beta) $$

Instead of assuming that $\mu$ and $\tau$ are *fixed known values*, we assume they are *uknown parameters* which need to estimate.

## Example

Normal model: common variance

$$y_i \sim N(\mu_i, \sigma^2) $$
$$\mu_i \sim N(\mu, \tau^2) $$
$$\sigma^2 \sim IG(\alpha, \beta) $$
Hyperpriors

$$ \mu \sim N(\mu_0, v_0), \\
\tau^2 \sim IG(t_1, t_2)$$

## Levels of hierarchy

- Data model
- Process model
- Parameter model
- Hyperparameters

## Key points about hierarchical models
- They allow to write down models that explain *variability in the parameters* of a model.
- *Partition variability* more explicitly into multiple terms.
- *Borrow strength* across data sets.
- 'Hierarchy' is with respect to parameters.

## Excercise

- How would you model these observations?

<center>
![](assets/dogs1.jpeg){width=30%}
![](assets/dogs4.jpeg){width=30%}
![](assets/cats1.jpeg){width=28%}
</center>

## Model comparison
- Frequentist way:
  - optimise different models on the <ins>training set</ins>,
  - and compare them on the <ins>test set</ins>.
- Bayesian way:
  - use <ins>model evidence</ins> $$p(Data | Model_i)$$.
  - It tends to choose models which are just complex enough to model the data, but not overly complex (by which it avoids <ins>overfitting</ins>).

## Information criteria
*Information criteria* are popular tools for model selection that provide a quantitative measure of the trade-off between model fit and complexity. The *lower* the value of IC, the better.

## Examples of Information Criteria:
- AIC (Akaike Information Criterion):
$$\text{AIC} = -2* \ln(L) + 2k,\\
L - \text{maximum value of likelihood},\\
k - \text{number of parameters}$$
- BIC (Bayesian Information Criterion):
$$\text{BIC} = -2 \ln(L) + k \ln(n)$$
- Widely Applicable Information Criterion (WAIC)

# Questions?