data-prep.Rmd

---
title: "Source data preparation for the England school admissions dashboard"
author: "Clare Gibson"
output:
  github_document:
    toc: true
    toc_depth: 3
---

# Setup

```{r setup}
# Load packages
library(knitr)          # for Rmd options
library(tidyverse)      # for general wrangling
library(data.table)     # for transposing data frames
library(janitor)        # for cleaning column headers
library(lubridate)      # for working with dates
library(googlesheets4)  # for reading in Google Sheets files

# Knitr Options
knitr::opts_chunk$set(
	echo = TRUE,
	fig.align = "center",
	message = FALSE,
	warning = FALSE
)
```

# Introduction

This notebook serves as the documented code to read and prepare data for the England School Admissions Dashboard project ([github](https://github.com/clarelgibson/england-school-admissions) \| [tableau](https://public.tableau.com/views/Schools_16505251102060/Home?:language=en-GB&:display_count=n&:origin=viz_share_link)).

The notebook begins by reading in the source data required for the dashboards. It then works through the data wrangling steps necessary to clean and prepare the data for transfer to Tableau.

## Source data

The full list of data sources, download links and retrieval steps can be found in the [project wiki](https://github.com/clarelgibson/england-school-admissions/wiki/Source-data).

## Custom functions

I have created several custom functions to assist with data prep. Load these functions by sourcing the `utils.R` script in the working directory.

```{r custom-functions}
# Custom functions
source("utils.R")
```

# Read data

## Info

This file contains descriptive fields relating to primary and secondary schools across all local authorities in England. It excludes nurseries, special schools, children's centers, pupil referral units and post-16 education. We can consider this largely to be the dimensional data for the dashboard, while the other datasets for Offers and Performance can be considered to be the measures.

The source data is stored in a CSV file on [Google Drive](https://drive.google.com/drive/u/1/folders/1lFZhobbGoCKKEtaZ5CDiphTbOxcbvJzW). We can read the CSV file directly from the source location using the custom function `read_csv_gdrive()`. When reading in source data, I usually assign it to a variable with the prefix `src_` to indicate that this is source data. Once I start to modify the data, I remove the prefix. That way, I always have a copy of the original source data in its unaltered form to refer back to.

```{r info-read}
# Read in the source data for info
src_info_path <- "https://drive.google.com/file/d/1kUqKvphHnh4M-NfP4gXLAr9MxdxoWgMT/view?usp=sharing"
src_info <- read_csv_gdrive(src_info_path)
```

This data frame has `r nrow(src_info)` rows and `r ncol(src_info)` columns. The columns include some identifiers for time, geography and school, some descriptive dimensions about each school and some measures relating to the number of pupils of different types. Each row represents a single school and the data represents the latest available.

## Offers

### By Local Authority

This file contains data relating to the number of applications from offers made to applicants for secondary and primary school places since 2014. The data is aggregated at the local authority level. This file will be important as our single source of truth for local authority-level measures (e.g. the number of preferences allowed on the application form and the total number of applications received).

```{r offla-read}
# Read in the source data for offla
src_offla_path <- "https://drive.google.com/file/d/12jONhUf3xSZdRbsUE_4tZRA014sIHO_D/view?usp=sharing"
src_offla <- read_csv_gdrive(src_offla_path)
```

### By School

This file contains data relating to the number of offers made to applicants for secondary and primary school places since 2014, and the proportion which received preferred offers. Data is aggregated at the school level. Further details including an explanation of the terminology used in this dataset can be found [here](https://drive.google.com/drive/u/1/folders/1bbnbd9d4HDn3yXIIhaRSigZV46KpCyKy). Once again, we can connect to it through Google Drive.

```{r offers-read}
# Read in the source data for offers
src_offers_path <- "https://drive.google.com/file/d/1bG0a0WRg-LEnASl6M3J0qyHJA_cMY1CQ/view?usp=sharing"
src_offers <- read_csv_gdrive(src_offers_path)
```

From the code outputs above, we can see that this data frame has `r nrow(src_offers)` rows and `r ncol(src_offers)` columns (a mix of character and numeric data types). The columns include some identifiers for time, geography and school, some descriptive dimensions for the number of preferences and type of school and some measures relating to the number of applications and offers made. Each row represents a single school for a single academic year.

## Performance

These files contain key stage 2 (KS2) performance data for primary schools and key stage 4 (KS4) performance data for secondary schools in England for academic years from 2014 onwards. Note that due to the COVID-19 pandemic, the UK Government cancelled all statutory national curriculum assessments due to be held in summer 2020 and 2021 at all Key Stages. Therefore, no data is available for the 2019/20 or 2020/21 academic years. Further details including an explanation of the terminology used in this dataset can be found [here](https://drive.google.com/drive/folders/1vai66CUaYhPI0RXQI-BlC5kufZZyf3JU?usp=sharing).

The source data for performance is stored in several CSV files on the Google Drive. Having read through the guidance notes for these files, for the measures I want to compare, I can find these in all years from 2015/16 onwards.

```{r performance-read, message=FALSE}
# Read in the performance data files from 2015/16 onwards
# 2015-16 KS2
src_perf_1516ks2_path <- "https://drive.google.com/file/d/1migtFWfMUr8HJDeujfw6h7p2nNDVsbp4/view?usp=sharing"
src_perf_1516ks2 <- read_csv_gdrive(src_perf_1516ks2_path)

# 2015-16 KS4
src_perf_1516ks4_path <- "https://drive.google.com/file/d/1ijl88VsaaNdEC-N2O9hcRKc2sixcMyES/view?usp=sharing"
src_perf_1516ks4 <- read_csv_gdrive(src_perf_1516ks4_path)

# 2016-17 KS2
src_perf_1617ks2_path <- "https://drive.google.com/file/d/1B1LpH8RXlBoH7C9mJPL4FribTvIUzekq/view?usp=sharing"
src_perf_1617ks2 <- read_csv_gdrive(src_perf_1617ks2_path)

# 2016-17 KS4
src_perf_1617ks4_path <- "https://drive.google.com/file/d/1jOOmuKXDDp--cbESCz-j4Vp7ufVK3tPV/view?usp=sharing"
src_perf_1617ks4 <- read_csv_gdrive(src_perf_1617ks4_path)

# 2017-18 KS2
src_perf_1718ks2_path <- "https://drive.google.com/file/d/1zJxcJrsOr7nvWwb6Yyg48QoMgSp76e8V/view?usp=sharing"
src_perf_1718ks2 <- read_csv_gdrive(src_perf_1718ks2_path)

# 2017-18 KS4
src_perf_1718ks4_path <- "https://drive.google.com/file/d/1-5WDJ5dZulWmqQa8fX5eczA0Bo2m_jEc/view?usp=sharing"
src_perf_1718ks4 <- read_csv_gdrive(src_perf_1718ks4_path)

# 2018-19 KS2
src_perf_1819ks2_path <- "https://drive.google.com/file/d/1CmXvDtDOmIlXXUzZ9SPW4Yoti-rTzi75/view?usp=sharing"
src_perf_1819ks2 <- read_csv_gdrive(src_perf_1819ks2_path)

# 2018-19 KS4
src_perf_1819ks4_path <- "https://drive.google.com/file/d/1uBhU0yeuV97d66PRR60pq8UaRWvdGIaA/view?usp=sharing"
src_perf_1819ks4 <- read_csv_gdrive(src_perf_1819ks4_path)
```

These files contain a large number of columns. We will need to pick out only the most relevant for reporting, which can be standardised across all schools.

## Phase

This dataset includes details about the educational phase. Since there are some schools in England that offer both primary and secondary education, we will need a way to distinguish the phases in the fact table.

```{r phase-read}
# Read in the source data for phase
src_phase_path <- "https://drive.google.com/file/d/1I9Ibo-hG_VzXvzNgOVMi6PPYfSCuJzgG/view?usp=sharing"
src_phase <- read_csv_gdrive(src_phase_path)
```

This dataset has `r nrow(src_phase)` rows and `r ncol(src_phase)` columns and contains largely descriptive information about the educational phases in England.

## Calendar

This dataset includes details about the academic years covered by the data.

```{r calendar-read}
# Read in the source data for calendar
src_calendar_path <- "https://drive.google.com/file/d/1el2cL02nQidzN1pMcZ7AUb3h6wXuzN_E/view?usp=sharing"
src_calendar <- read_csv_gdrive(src_calendar_path)
```

This will be our date table. Since the lowest unit of time in the data is year, this is the level of detail of the calendar table. This table has `r nrow(src_calendar)` rows and `r ncol(src_calendar)` columns.

# Clean data

## Info

One key piece of cleaning we need to do with this dataset is to provide a mechanism to group entities that describe the same establishment (i.e. where there has been a change in URN over time due to conversion to academy). We can do this using the `links` columns at the end of this data frame, along with the `urn`, to create a bridge table, which we'll call `brg_school`.

The links I am interested in are the predecessor-successor links so that I can keep schools grouped over time. I want the successor school to be retained as the master school and all predecessors to be linked to the master. If a school has no links then we simply list the master and linked URNs as the same number. Therefore, I need to keep all of the open schools as the master list and any predecessors need to be linked to those master URNs.

```{r bridge-school}
# Set up the bridge table
brg_school <- src_info %>%
  # keep only currently open schools
  filter(grepl("^Open.*", establishment_status_name)) %>%
  # select required columns
  select(master_urn = urn,
         status = establishment_status_name,
         starts_with("link_")) %>% 
  # pivot the links
  pivot_longer(!c(master_urn, status),
               values_drop_na = TRUE) %>% 
  # drop the name column
  select(-name) %>% 
  # drop the non-useful link types
  mutate(value = case_when(
    value == "Does not have links" ~ value,
    grepl(".*Predecessor.*", value) ~ value,
    TRUE ~ "No useful links"
  )) %>% 
  # split the description column into useful data
  mutate(linked_urn = as.numeric(str_extract(value,"\\d+")),
         link_type = str_trim(str_extract(value,"\\D+"))) %>% 
  # drop the value column
  select(-value) %>% 
  # add self-urn
  mutate(self_urn = master_urn) %>% 
  pivot_longer(cols = c(linked_urn, self_urn),
               names_to = "urn_type",
               values_to = "linked_urn",
               values_drop_na = TRUE) %>% 
  # select required columns
  select(master_urn,
         linked_urn) %>% 
  distinct()
```

Now that the bridge table is set up I can filter the `src_info` table to include only the currently open schools. The number of distinct URNs in this table should then match the number of distinct master URNs in `brg_school`.

```{r info-filter}
# Keep only the currently open schools in info
info <- src_info %>% 
  filter(grepl("^Open.*",
               establishment_status_name))
```

```{r info-urn-count-check}
# Check if the number of distinct URNs in info matches the number of
# distinct master URNs in brg_school
n_distinct(info$urn) == n_distinct(brg_school$master_urn)
```

The numbers match. Success!

Next, let's keep only the columns we need from `src_info`.

```{r info-select-cols}
# Keep only the required columns from src_info
info <- info %>% 
  select(urn:establishment_status_name,
         open_date,
         close_date,
         phase_of_education_name:statutory_high_age,
         gender_name,
         religious_character_name,
         admissions_policy_name,
         school_capacity,
         number_of_pupils:percentage_fsm,
         ofsted_last_insp:head_preferred_job_title,
         gor_name:lsoa_name,
         ofsted_rating_name,
         msoa_code:fsm)
```

Now we can rename the columns more appropriately.

```{r info-rename}
# Rename columns in info
info <- info %>% 
  rename(establishment_type = type_of_establishment_name,
         establishment_type_group = establishment_type_group_name,
         establishment_status = establishment_status_name,
         education_phase = phase_of_education_name,
         gender = gender_name,
         religious_character = religious_character_name,
         admissions_policy = admissions_policy_name,
         county = county_name,
         head_title = head_title_name,
         gor = gor_name,
         administrative_district = district_administrative_name,
         administrative_ward = administrative_ward_name,
         parliamentary_constituency = parliamentary_constituency_name,
         urban_rural_indicator = urban_rural_name,
         gss_la_code = gssla_code_name,
         msoa = msoa_name,
         lsoa = lsoa_name,
         ofsted_rating = ofsted_rating_name,
         number_of_fsm = fsm)
```

And convert the columns to the correct data types.

```{r info-type}
# Define the columns by type
info_date <- c("open_date",
               "close_date",
               "ofsted_last_insp",
               "last_changed_date")

# Correct the data types in info
info <- info %>% 
  mutate(across(all_of(info_date), dmy)) %>% 
  mutate(telephone_num = as.character(telephone_num),
         telephone_num = paste0("0", telephone_num))
```

Let's now review the selected column headings in this dataset and the number of unique values in each column

```{r info-unique-values}
# Print a list of the column headings in the info df
info %>% 
  summarise_all(n_distinct) %>% 
  transpose(keep.names = "field") %>% 
  rename(unique_values = V1)
```

The `la_name` and `la_code` fields need to be checked to ensure they are consistent. I refer to the [Office for National Statistics](https://www.ons.gov.uk) [Local Authority Districts (December 2021) Names and Codes in the United Kingdom](https://geoportal.statistics.gov.uk/documents/c4f647d8a4a648d7b4a1ebf057f8aaa3/about) document for the latest reference data for these fields. This document is included in the Google Drive [ref folder](https://drive.google.com/drive/folders/1lsRDMfjSzabSF8HVJdF8ex5JM6rDRlSo?usp=sharing).

Let's review the unique values of LA code and name in the `info` df.

```{r info-review-la-names}
# Review the unique values in la_name
info %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code, gss_la_code)
```

We can see that some of the `gss-la-code` values here are wrong. It looks like a value of `X999999` has been used where the true value could not be determined when the source data was put together. We need to replace this value with the correct value according to the reference table.

```{r info-clean-gss-code}
# Correct the erroneous gss_la_code fields
info <- info %>% 
  group_by(la_code, la_name) %>% 
  mutate(gss_la_code = min(gss_la_code)) %>% 
  ungroup()

# Check the results
info_la <- info %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code, gss_la_code)
```

Some columns have relatively few unique values, making these columns clearly categorical. I'd like to review the values to ensure they are consistent and well labelled.

```{r info-review-establishment-type}
# Review the unique values in establishment type
unique(info$establishment_type)
```

These values all look ok. [This page](https://www.leicester.gov.uk/schools-and-learning/school-and-colleges/school-admissions/understanding-the-different-types-of-school/) has some useful definitions of the different types of school.

```{r info-review-establishment-group}
# Review the unique values in establishment group
unique(info$establishment_type_group)
```

All ok.

```{r info-review-status}
# Review the unique values in establishment status
unique(info$establishment_status)
```

All ok.

```{r info-review-phase}
# Review the unique values in phase of education
unique(info$education_phase)
```

All ok.

```{r info-review-gender}
# Review the unique values in gender
unique(info$gender)
```

Some missing values here. Let's replace those with 'Not reported'.

```{r info-recode-gender}
# Convert missing values in gender
info$gender[is.na(info$gender)] <- "Not reported"

# Check the results
unique(info$gender)
```

All ok.

```{r info-review-religious-character}
# Review the unique values in religious character
unique(info$religious_character)
```

Again, let's fix the missing values.

```{r info-recode-religious-character}
# Convert missing values in religious character
info$religious_character[is.na(info$religious_character)] <- 
  "Not reported"

# Check the results
unique(info$religious_character)
```

All ok.

```{r info-review-admissions-policy}
# Review the unique values in admissions policy
unique(info$admissions_policy)
```

Again let's fix the missing values.

```{r info-recode-admissions-policy}
# Convert missing values in admissions policy
info$admissions_policy[is.na(info$admissions_policy)] <- 
  "Not reported"

# Check the results
unique(info$admissions_policy)
```

All ok.

```{r info-review-ofsted}
# Review the unique values in ofsted rating
unique(info$ofsted_rating)
```

Some recoding is necessary here. According to [Ofsted](https://www.gov.uk/government/publications/education-inspection-framework/education-inspection-framework), school inspections use a 4-point grading scale:

-   grade 1 - outstanding
-   grade 2 - good
-   grade 3 - requires improvement
-   grade 4 - inadequate

In our dataset, I see some other values of `Serious Weaknesses` and `Special Measures`, which are not on the ratings list provided by Ofsted. In fact, both of these ratings are a [subset](https://www.gov.uk/government/publications/school-inspections-a-guide-for-parents/school-inspections-a-guide-for-parents) of the "inadequate" (grade 4) rating. Therefore, I can recode both of these values as `inadequate`. I can also add in the grade points to the ratings dimension table to provide a quantitative and aggregatable measure.

```{r info-recode-ofsted}
# Fix values for ofsted rating
info <- info %>% 
    mutate(ofsted_rating = case_when(
      grepl("Weakness", ofsted_rating) ~ "Inadequate",
      grepl("Measures", ofsted_rating) ~ "Inadequate",
      is.na(ofsted_rating) ~ "Not reported",
      TRUE ~ ofsted_rating
    ),
    ofsted_rating_score = case_when(
    grepl("Outstanding", ofsted_rating) ~ 1,
    grepl("Good", ofsted_rating) ~ 2,
    grepl("improvement", ofsted_rating) ~ 3,
    grepl("Inadequate", ofsted_rating) ~ 4
  ))

# Check results
unique(info$ofsted_rating)
unique(info$ofsted_rating_score)
```

## Offers

### By Local Authority

From the `src_offla` dataset we need to keep the identifiers for local authority, year and phase as well as the numerical data. We should filter the source data to keep only data at the `geographic_level` of "Local authority".

```{r offla-select-cols}
# Select the required columns from offla and correct data types
offla <- src_offla %>% 
  mutate(across(everything(), as.character)) %>% 
  filter(geographic_level == "Local authority") %>% 
  select(-c(nc_year_admission,
            time_identifier,
            geographic_level,
            country_code,
            country_name,
            one_of_the_three_preference_offers,
            ends_with("_percent"))) %>% 
  type_convert(na = "c")
```

Now we can rename the columns more appropriately.

```{r offla-rename}
# Rename the columns
offla <- offla %>% 
  rename(year = time_period,
         gss_la_code = new_la_code,
         la_code = old_la_code,
         phase = school_phase,
         number_of_preferences = no_of_preferences)
```

Now let's review the entries for the local authority identifiers and check which rows in `offla` are not present in `info_la` and therefore need to be changed.

```{r offla-la-names}
# Review the unique values of la_name
offla %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code) %>% 
  anti_join(info_la)
```

Let's make the necessary replacements.

```{r offla-recode-la}
# Recode the LA identifier fields in offla
offla <- offla %>% 
  mutate(
    la_code = case_when(
      la_code == 837 ~ 839,
      la_code == 835 ~ 838,
      la_code == 928 ~ 940,
      la_code == 836 ~ 839,
      TRUE ~ la_code
    ),
    gss_la_code = case_when(
      gss_la_code == "E06000028" ~ "E06000058",
      gss_la_code == "E10000002" ~ "E06000060",
      gss_la_code == "E10000009" ~ "E06000059",
      gss_la_code == "E08000020" ~ "E08000037",
      gss_la_code == "E10000021" ~ "E06000061",
      gss_la_code == "E06000048" ~ "E06000057",
      gss_la_code == "E06000029" ~ "E06000058",
      TRUE ~ gss_la_code
    ),
    la_name = case_when(
      la_name == "Bournemouth" ~ "Bournemouth, Christchurch and Poole",
      la_name == "Northamptonshire" ~ "North Northamptonshire",
      la_name == "Poole" ~ "Bournemouth, Christchurch and Poole",
      TRUE ~ la_name
    )
  )

# Extra rows because Northamptonshire is now two distinct regions
offla_extra <- tibble(
  year = c(202122, 202122, 202021, 202021, 201920, 201920, 201819, 201819,
           201718, 201718, 201617, 201617, 201516, 201516, 201415, 201415),
  region_code = rep("E12000004", 16),
  region_name = rep("East Midlands", 16),
  gss_la_code = rep("E06000062", 16),
  la_code = rep(941, 16),
  la_name = rep("West Northamptonshire", 16),
  phase = rep(c("Primary", "Secondary"), 8),
  number_of_preferences = rep(3,16)
)

# Bind extra rows on to prefs
offla <- offla %>% 
  bind_rows(offla_extra) %>% 
  group_by_at(vars(-number_of_preferences)) %>% 
  slice_max(number_of_preferences) %>% 
  ungroup() %>% 
  distinct()

# Check the results
offla_la <- offla %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code) 

offla_la %>% 
  anti_join(info_la)
```

```{r offla-review-year}
# Review the unique values of year
unique(offla$year)
```

These values should be recoded to match the `year_key` in the calendar table. We need to extract the academic start year, which is the first 4 digits of the value.

```{r offla-recode-year}
# Recode the values in year
offla$year <- as.numeric(str_extract(offla$year, "^\\d{4}"))

# Check the results
unique(offla$year)
```

### By School

Let's keep only the columns we need from `src_offers`. Some of the columns are duplicates of columns already in the `info` table, and some are just not required.

```{r offers-select-cols}
# Select required columns from offers
offers <- src_offers %>% 
  select(time_period,
         region_code,
         region_name,
         old_la_code,
         new_la_code,
         la_name,
         total_number_places_offered:offers_to_applicants_from_another_la,
         denomination,
         school_urn,
         entry_year)
```

Next we can rename the columns more appropriately.

```{r offers-rename}
# Rename the columns
offers <- offers %>% 
  rename(year = time_period,
         la_code = old_la_code,
         gss_la_code = new_la_code,
         religious_denomination = denomination,
         urn = school_urn,
         phase = entry_year)
```

Next we can filter out any records where the `school_urn` is `n/a` and convert the remaining values to numeric.

```{r offers-filter}
# Remove entries with null URN
offers <- offers %>% 
  filter(urn != "n/a") %>% 
  mutate(urn = as.numeric(urn))
```

Now we can check and correct the data types.

```{r offers-type}
# Define cols to be converted
offers_num <- c("proportion_1stprefs_v_1stprefoffers",
                "proportion_1stprefs_v_totaloffers",
                "offers_to_applicants_from_another_la")

# Correct data types in offers
offers <- offers %>% 
  mutate(across(all_of(offers_num), as.numeric))
```

Let's now review the selected column headings in this dataset and the number of unique values in each column.

```{r offers-unique-values}
# Print a list of the column headings in the offers df
offers %>% 
  summarise_all(n_distinct) %>% 
  transpose(keep.names = "field") %>% 
  rename(unique_values = V1)
```

Let's review the categorical values to ensure they are consistent and well labelled.

```{r offers-review-la-name}
# Review the unique values of la_name
offers %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code)
```

Some of these entries need to be corrected to match the values in `info`.

```{r offers-clean-la-name}
offers_la <- offers %>% 
  select(urn,
         year,
         la_code,
         gss_la_code,
         la_name,
         region_code,
         region_name) %>% 
  distinct() %>% 
  arrange(urn,
          -year,
          la_code,
          la_name) %>% 
  group_by(urn) %>% 
  slice_head(n = 1) %>% 
  ungroup() %>%
  inner_join(brg_school,
            by = c("urn" = "linked_urn")) %>% 
  left_join(select(info,
                   master_urn = urn,
                   master_la_code = la_code,
                   master_gss_la_code = gss_la_code,
                   master_la_name = la_name)) %>% 
  mutate(la_code = coalesce(master_la_code, la_code),
         gss_la_code = coalesce(master_gss_la_code, gss_la_code),
         la_name = coalesce(master_la_name, la_name)) %>% 
  select(urn,
         master_urn,
         la_code,
         gss_la_code,
         la_name,
         region_code,
         region_name) %>% 
  distinct()
```

```{r offers-update-la-name}
offers <- offers %>% 
  select(-c(region_code,
            region_name,
            la_code,
            gss_la_code,
            la_name)) %>% 
  inner_join(offers_la)

# Check the results
offers_la <- offers %>% 
  select(la_code, gss_la_code, la_name) %>% 
  distinct() %>% 
  arrange(la_name, la_code)
```

```{r offers-review-year}
# Review the unique values of year
unique(offers$year)
```

These values should be recoded to match the `year_key` in the calendar table. We need to extract the academic start year, which is the first 4 digits of the value.

```{r offers-recode-year}
# Recode the values in year
offers$year <- as.numeric(str_extract(offers$year, "^\\d{4}"))

# Check the results
unique(offers$year)
```

All ok.

```{r offers-review-denomination}
# Review the unique values in denomination
unique(offers$religious_denomination)
```

There are some values of `n/a` here. Let's see what those are.

```{r offers-denomination-na}
# Review the n/a values in denomination
offers %>% 
  filter(religious_denomination == "n/a") %>% 
  head()
```

I guess this must be for schools who did not report their denomination. Let's replace `n/a` with `Not reported`.

```{r offers-denomination-replace-na}
# Replace n/a
offers$religious_denomination[offers$religious_denomination == "n/a"] <- "Not reported"

# Check the results
unique(offers$religious_denomination)
```

All ok.

```{r offers-review-phase}
# Review the unique values in phase
unique(offers$phase)
```

Here we need to make some changes so that this column can be used to refer to the educational phase of the data. We need to recode these values to match the phases in our phase table (primary or secondary). The value `R` corresponds to primary and `7` and `9` both correspond to secondary.

```{r offers-record-phase}
# Recode the phase values
offers <- offers %>% 
  mutate(phase = case_when(
    phase == "R" ~ "Primary",
    phase %in% c("7", "9") ~ "Secondary",
    TRUE ~ phase
  ))

# Check the results
unique(offers$phase)
```

## Performance

From this dataset we really need to select one or two good measures of performance for KS2 and KS4, along with the URN so that we can join back to school.

-   For KS2 we'll used the progress measures `readprog`, `writprog` and `matprog`. [This article](https://www.theschoolrun.com/understanding-primary-school-league-tables) has a good explanation of the KS2 progress scores, and [this article](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/851788/KS2_progress_banding_calculations_bandings_2019.pdf) explains how the bandings can be calculated.
-   For KS4 we'll use the progress 8 and attainment 8 measures `p8mea` and `att8scr`. [This article](https://www.goodschoolsguide.co.uk/curricula-and-exams/progress-8-attainment-8) has a good explanation of these measures.

We will further need to filter both the KS2 and KS4 data to include only rows that are aggregated at the mainstream school level (`RECTYPE = 1`).

```{r perf-ks2-clean}
# Clean the KS2 data
# Bind rows from all years
perf_ks2 <- bind_rows(
  "2015" = src_perf_1516ks2,
  "2016" = src_perf_1617ks2,
  "2017" = src_perf_1718ks2,
  "2018" = src_perf_1819ks2,
  .id = "year"
) %>% 
  # Filter to rectype == 1
  filter(rectype == 1) %>% 
  # Select the required columns
  select(year,
         urn,
         readprog,
         writprog,
         matprog) %>% 
  # convert to numeric (any strings can be converted to NA)
  mutate(across(everything(), as.numeric)) %>% 
  # add a column for phase
  mutate(phase = "Primary")

# View the head
head(perf_ks2)
```

```{r perf-ks4-clean}
# Bind rows from all years
perf_ks4 <- bind_rows(
  "2015" = select(src_perf_1516ks4, rectype, urn, p8mea, att8scr),
  "2016" = select(src_perf_1617ks4, rectype, urn, p8mea, att8scr),
  "2017" = select(src_perf_1718ks4, rectype, urn, p8mea, att8scr),
  "2018" = select(src_perf_1819ks4, rectype, urn, p8mea, att8scr),
  .id = "year"
) %>% 
  # Filter to rectype == 1
  filter(rectype == 1) %>% 
  # Select the required columns
  select(year,
         urn,
         p8mea,
         att8scr) %>% 
  # convert to numeric (any strings can be converted to NA)
  mutate(across(everything(), as.numeric)) %>% 
  # add the phase
  mutate(phase = "Secondary")

# View the head
head(perf_ks4)
```

Now we can join the KS2 and KS4 data together and recode the progress measure names.

```{r performance-combined}
# Combine two data frames
performance <- perf_ks2 %>% 
  full_join(perf_ks4) %>% 
  rename(reading_progress = readprog,
         writing_progress = writprog,
         maths_progress = matprog,
         progress_8 = p8mea,
         attainment_8 = att8scr) %>% 
  left_join(select(info,
                   urn,
                   la_code,
                   la_name))

# View the head
head(performance)
```

## Phase

No cleaning required. We can simply rename without the `src_` prefix.

```{r phase-clean}
# Rename the variable
phase <- src_phase
```

## Calendar

No cleaning required. We can simply rename without the `src_` prefix.

```{r calendar-clean}
# Rename the variable
calendar <- src_calendar
```

# Model data

The matrix below shows the fact and dimension tables that I intend to create for this model and the relationships between them.

| Fact/Dimension   | dim_la | dim_school | dim_calendar | dim_phase |
|------------------|:------:|:----------:|:------------:|:---------:|
| fact_la_year     |   x    |            |      x       |     x     |
| fact_school_year |   x    |     x      |      x       |     x     |
| fact_school      |   x    |     x      |              |           |

: Fact and dimension tables

To build out the dimensional model for this data, we need to review all of the columns we intend to use, so that we can determine where they fit in the model. The custom function `describe_df()` contained in the `utils.R` file can help us with this.

```{r describe-df}
# Create a vector of dfs to describe
data <- list("info" = info,
             "offers_by_la" = offla,
             "offers" = offers,
             "performance" = performance,
             "phase" = phase,
             "calendar" = calendar)

# Run describe_df() over each df
t_data <- lapply(data, describe_df)

# Bind each df into a single df
t_data <- bind_rows(t_data, .id = "source_table")
```

After exporting the new `t_data` dataframe to a spreadsheet, I created a file named [star-schema](https://docs.google.com/spreadsheets/d/1rrlhlvFrnPpTQtn2YzWyJZzwlm-2wWAzDiEv6fKBoHU/edit?usp=sharing) which maps the raw source data to the corresponding dimension and fact tables for the star schema model.

```{r read-star-schema-planning}
# Read in star schema planning document
star_schema_path <- "https://docs.google.com/spreadsheets/d/1rrlhlvFrnPpTQtn2YzWyJZzwlm-2wWAzDiEv6fKBoHU/edit?usp=sharing"
star_schema <- read_sheet(star_schema_path,
                          col_types = "c")
```

## Dimensions

### Local Authority

The following columns are required for the local authority dimension table:

```{r dim-la-cols}
# Which columns are needed for the local authority dimension?
star_schema %>% 
  filter(model_table == "dim_la") %>% 
  kable()
```

These columns can all come from the `offers` and `info` data frames.

```{r dim-la-build}
# Build out the local authority dimension table
dim_la <- offers %>% 
  select(la_code,
         gss_la_code,
         la_name, 
         region_code,
         region_name) %>% 
  distinct() %>% 
  full_join(info_la) %>% 
  mutate(region_code = case_when(la_code == 420 ~ "E12000009",
                                 TRUE ~ region_code),
         region_name = case_when(la_code == 420 ~ "South West",
                                 TRUE ~ region_name))

# Check the result
head(dim_la)
```

Let's assign a unique key to each record, and add a record for the `null` case, which we'll call `Not reported`.

```{r dim-la-key}
# Specify the null value
dim_la_null <- tibble(la_key = 0,
                      la_code = 0,
                      la_name = "Not reported",
                      gss_la_code = "Not reported",
                      region_code = "Not reported",
                      region_name = "Not reported")

# Assign a key and row for null values
dim_la <- dim_la %>% 
  rowid_to_column(var = "la_key") %>% 
  bind_rows(dim_la_null) %>% 
  arrange(la_key)
```

### School

The following fields are required for the school dimension.

```{r dim-school-cols}
# Which columns are needed for the school dimension?
star_schema %>% 
  filter(model_table == "dim_school") %>% 
  kable()
```

These columns can all be found in the `info` and `offers` tables. We need to ensure that we include only the URNs in the `master_urn` column of our bridge table.

```{r dim-school-build}
# Define columns to include
dim_school_cols_info <- star_schema %>% 
  filter(model_table == "dim_school",
         source_table == "info") %>% 
  pull(source_field)

dim_school_cols_offers <- star_schema %>% 
  filter(model_table == "dim_school",
         source_table == "offers") %>% 
  pull(source_field)

# Define which rows from offers should be joined (as there can be more than
# one record per URN)
dim_school_offers <- offers %>% 
  select(all_of(dim_school_cols_offers),
         year,
         phase) %>% 
  distinct() %>% 
  group_by(urn) %>% 
  slice_max(year) %>%
  slice_max(phase) %>% 
  ungroup()

# Build out the school dimension table
dim_school <- brg_school %>% 
  select(urn = master_urn) %>% 
  distinct() %>% 
  left_join(select(info,
                   all_of(dim_school_cols_info))) %>% 
  left_join(select(dim_school_offers,
                   all_of(dim_school_cols_offers)))
```

Let's assign a unique key to each record, and add a record for the `null` case, which we'll call `Not reported`.

```{r dim-school-key}
# Specify the null value
dim_school_null <- tibble(school_key = 0,
                          urn = 0,
                          establishment_number = 0,
                          establishment_name = "Not reported",
                          establishment_type = "Not reported",
                          establishment_type_group = "Not reported",
                          establishment_status = "Not reported",
                          open_date = NA,
                          close_date = NA,
                          education_phase = "Not reported",
                          statutory_low_age = NA,
                          statutory_high_age = NA,
                          gender = "Not reported",
                          religious_character = "Not reported",
                          admissions_policy = "Not reported",
                          ofsted_last_insp = NA,
                          last_changed_date = NA,
                          gor = "Not reported",
                          administrative_district = "Not reported",
                          administrative_ward = "Not reported",
                          parliamentary_constituency = "Not reported",
                          urban_rural_indicator = "Not reported",
                          msoa_code = "Not reported",
                          lsoa_code = "Not reported",
                          ofsted_rating = "Not reported",
                          religious_denomination = "Not reported")

# Assign a key and row for null values
dim_school <- dim_school %>% 
  rowid_to_column(var = "school_key") %>% 
  bind_rows(dim_school_null) %>% 
  arrange(school_key)
```

### Phase

The following columns are needed for the phase dimension:

```{r dim-phase-cols}
# Which columns are needed for the phase dimension?
star_schema %>% 
  filter(model_table == "dim_phase") %>% 
  kable()
```

For this table we just need to use the `phase` data frame as it already exists.

```{r dim-phase-build}
# Build the dim_phase table
dim_phase <- phase
```

### Calendar

For this table we just need to use the `calendar` data frame as it already exists.

```{r dim-calendar-build}
# Build the calendar dimension
dim_calendar <- calendar
```

## Facts

### School Year Facts

To build the school year facts table we need the following columns:

```{r school-year-fact-cols}
# Which columns are needed for the school year fact table?
star_schema %>% 
  filter(model_table == "fact_school_year") %>% 
  kable()
```

These columns all come from `offers`, `prefs` and `performance`.

```{r build-fact-school-year}
# Define columns to include
fct_school_year_cols_offers <- star_schema %>% 
  filter(model_table == "fact_school_year",
         source_table == "offers") %>% 
  pull(source_field)

fct_school_year_cols_perf <- star_schema %>% 
  filter(model_table == "fact_school_year",
         source_table == "performance") %>% 
  pull(source_field)

# Define the performance data to include
fct_school_year_perf <- performance %>% 
  select(linked_urn = urn,
         year,
         phase_name = phase,
         all_of(fct_school_year_cols_perf)) %>% 
  inner_join(brg_school) %>% 
  left_join(select(info,
                   master_urn = urn,
                   la_code,
                   la_name)) %>% 
  select(-master_urn)

# Build out the school year fact table
fct_school_year <- offers %>% 
  select(linked_urn = urn,
         la_code,
         la_name,
         year,
         phase_name = phase,
         all_of(fct_school_year_cols_offers)) %>% 
  full_join(fct_school_year_perf) %>% 
  # join school key (via bridge)
  inner_join(brg_school) %>% 
  left_join(select(dim_school,
                   master_urn = urn,
                   school_key)) %>% 
  # join la key
  left_join(select(dim_la,
                   la_code,
                   la_name,
                   la_key)) %>% 
  # join phase key
  left_join(select(dim_phase,
                   phase_name,
                   phase_key)) %>% 
  # select final columns
  select(school_key,
         la_key,
         year_key = year,
         phase_key,
         all_of(fct_school_year_cols_offers),
         all_of(fct_school_year_cols_perf)) %>% 
  # deal with duplicates
  group_by(school_key,
           la_key,
           phase_key,
           year_key) %>% 
  mutate(fct_school_year_count = n()) %>% 
  summarise_all(mean, na.rm = TRUE) %>% 
  ungroup() %>% 
  mutate(fct_school_year_flg = if_else(fct_school_year_count == 1,
                                       "Actual",
                                       "Average"))
```

### School Facts

To build the school facts table we will need to following columns, plus keys for all dimensions:

```{r school-fact-cols}
# Which columns are needed for the school fact table?
star_schema %>% 
  filter(model_table == "fact_school") %>% 
  kable()
```

These fields can all come from the `info` table.

```{r school-fact-build}
# Define columns to include
fct_school_cols <- star_schema %>% 
  filter(model_table == "fact_school",
         source_table == "info") %>% 
  pull(source_field)

# Build out the school fact table
fct_school <- info %>% 
  # select columns
  select(linked_urn = urn,
         la_code,
         la_name,
         education_phase,
         all_of(fct_school_cols)) %>% 
  # join school key (via bridge)
  left_join(brg_school) %>% 
  left_join(select(dim_school,
                   master_urn = urn,
                   school_key)) %>% 
  # join la key
  left_join(select(dim_la,
                   la_code,
                   la_name,
                   la_key)) %>% 
  # join phase key
  left_join(select(fct_school_year,
                   school_key,
                   phase_key)) %>% 
  mutate(phase_key = if_else(is.na(phase_key),
                             2,
                             phase_key)) %>% 
  # select final columns
  select(school_key,
         la_key,
         phase_key,
         all_of(fct_school_cols)) %>% 
  # deal with duplicates
  group_by(school_key,
           la_key,
           phase_key) %>% 
  mutate(fct_school_count = n()) %>% 
  summarise_all(mean, na.rm = TRUE) %>% 
  ungroup() %>% 
  mutate(fct_school_flg = if_else(fct_school_count == 1,
                                  "Actual",
                                  "Average"))
```

### Local Authority Year Facts

To build the LA year facts table we need the following columns:

```{r la-year-fact-cols}
# Which columns are needed for the LA year fact table?
star_schema %>% 
  filter(model_table == "fact_la_year") %>% 
  kable()
```

These columns all come from `offers_by_la`.

```{r build-fact-la-year}
# Define columns to include
fct_la_year_cols <- star_schema %>% 
  filter(model_table == "fact_la_year") %>% 
  pull(source_field)

# Build out the LA year fact table
fct_la_year <- offla %>% 
  select(la_code,
         la_name,
         year,
         phase_name = phase,
         all_of(fct_la_year_cols)) %>% 
  # join la key
  left_join(select(dim_la,
                   la_code,
                   la_name,
                   la_key)) %>% 
  # join phase key
  left_join(select(dim_phase,
                   phase_name,
                   phase_key)) %>% 
  # select final columns
  select(la_key,
         year_key = year,
         phase_key,
         all_of(fct_la_year_cols)) %>% 
  # deal with duplicates
  group_by(la_key,
           phase_key,
           year_key) %>% 
  mutate(fct_la_year_count = n()) %>% 
  summarise_all(mean, na.rm = TRUE) %>% 
  ungroup() %>% 
  mutate(fct_la_year_flg = if_else(fct_la_year_count == 1,
                                   "Actual",
                                   "Average"))
```

### All Facts
Finally we can join together the three individual fact tables to create a single fact table which will be more useful for Tableau.
```{r build-fact-all}
# Build a table for all facts
fct_all <- fct_school_year %>% 
  left_join(fct_school) %>% 
  left_join(fct_la_year)
```


# Export data

The final tables for the data model need to be saved as Rdata files, so that they can be accessed by Tableau.

```{r export-data}
# Define folder location for save
export_path <- "data-out/"

# Save objects
save(dim_la, file = paste0(export_path,
                           "dim_la.Rda"))

save(dim_school, file = paste0(export_path,
                               "dim_school.Rda"))

save(dim_calendar, file = paste0(export_path,
                                 "dim_calendar.Rda"))

save(dim_phase, file = paste0(export_path,
                              "dim_phase.Rda"))

save(fct_school, file = paste0(export_path,
                               "fct_school.Rda"))

save(fct_school_year, file = paste0(export_path,
                                    "fct_school_year.Rda"))

save(fct_la_year, file = paste0(export_path,
                                "fct_la_year.Rda"))

save(fct_all, file = paste0(export_path,
                            "fct_all.Rda"))
```