Clare Gibson
# Load packages
library(knitr) # for Rmd options
library(tidyverse) # for general wrangling
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(data.table) # for transposing data frames
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(janitor) # for cleaning column headers
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # for working with dates
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(googlesheets4) # for reading in Google Sheets files
# Knitr Options
knitr::opts_chunk$set(
echo = TRUE,
fig.align = "center",
message = FALSE,
warning = FALSE
)
This notebook serves as the documented code to read and prepare data for the England School Admissions Dashboard project (github | tableau).
The notebook begins by reading in the source data required for the dashboards. It then works through the data wrangling steps necessary to clean and prepare the data for transfer to Tableau.
The full list of data sources, download links and retrieval steps can be found in the project wiki.
I have created several custom functions to assist with data prep. Load
these functions by sourcing the utils.R
script in the working
directory.
# Custom functions
source("utils.R")
This file contains descriptive fields relating to primary and secondary schools across all local authorities in England. It excludes nurseries, special schools, children’s centers, pupil referral units and post-16 education. We can consider this largely to be the dimensional data for the dashboard, while the other datasets for Offers and Performance can be considered to be the measures.
The source data is stored in a CSV file on Google
Drive.
We can read the CSV file directly from the source location using the
custom function read_csv_gdrive()
. When reading in source data, I
usually assign it to a variable with the prefix src_
to indicate that
this is source data. Once I start to modify the data, I remove the
prefix. That way, I always have a copy of the original source data in
its unaltered form to refer back to.
# Read in the source data for info
src_info_path <- "https://drive.google.com/file/d/1kUqKvphHnh4M-NfP4gXLAr9MxdxoWgMT/view?usp=sharing"
src_info <- read_csv_gdrive(src_info_path)
This data frame has 36704 rows and 122 columns. The columns include some identifiers for time, geography and school, some descriptive dimensions about each school and some measures relating to the number of pupils of different types. Each row represents a single school and the data represents the latest available.
This file contains data relating to the number of applications from offers made to applicants for secondary and primary school places since 2014. The data is aggregated at the local authority level. This file will be important as our single source of truth for local authority-level measures (e.g. the number of preferences allowed on the application form and the total number of applications received).
# Read in the source data for offla
src_offla_path <- "https://drive.google.com/file/d/12jONhUf3xSZdRbsUE_4tZRA014sIHO_D/view?usp=sharing"
src_offla <- read_csv_gdrive(src_offla_path)
This file contains data relating to the number of offers made to applicants for secondary and primary school places since 2014, and the proportion which received preferred offers. Data is aggregated at the school level. Further details including an explanation of the terminology used in this dataset can be found here. Once again, we can connect to it through Google Drive.
# Read in the source data for offers
src_offers_path <- "https://drive.google.com/file/d/1bG0a0WRg-LEnASl6M3J0qyHJA_cMY1CQ/view?usp=sharing"
src_offers <- read_csv_gdrive(src_offers_path)
From the code outputs above, we can see that this data frame has 171385 rows and 35 columns (a mix of character and numeric data types). The columns include some identifiers for time, geography and school, some descriptive dimensions for the number of preferences and type of school and some measures relating to the number of applications and offers made. Each row represents a single school for a single academic year.
These files contain key stage 2 (KS2) performance data for primary schools and key stage 4 (KS4) performance data for secondary schools in England for academic years from 2014 onwards. Note that due to the COVID-19 pandemic, the UK Government cancelled all statutory national curriculum assessments due to be held in summer 2020 and 2021 at all Key Stages. Therefore, no data is available for the 2019/20 or 2020/21 academic years. Further details including an explanation of the terminology used in this dataset can be found here.
The source data for performance is stored in several CSV files on the Google Drive. Having read through the guidance notes for these files, for the measures I want to compare, I can find these in all years from 2015/16 onwards.
# Read in the performance data files from 2015/16 onwards
# 2015-16 KS2
src_perf_1516ks2_path <- "https://drive.google.com/file/d/1migtFWfMUr8HJDeujfw6h7p2nNDVsbp4/view?usp=sharing"
src_perf_1516ks2 <- read_csv_gdrive(src_perf_1516ks2_path)
# 2015-16 KS4
src_perf_1516ks4_path <- "https://drive.google.com/file/d/1ijl88VsaaNdEC-N2O9hcRKc2sixcMyES/view?usp=sharing"
src_perf_1516ks4 <- read_csv_gdrive(src_perf_1516ks4_path)
# 2016-17 KS2
src_perf_1617ks2_path <- "https://drive.google.com/file/d/1B1LpH8RXlBoH7C9mJPL4FribTvIUzekq/view?usp=sharing"
src_perf_1617ks2 <- read_csv_gdrive(src_perf_1617ks2_path)
# 2016-17 KS4
src_perf_1617ks4_path <- "https://drive.google.com/file/d/1jOOmuKXDDp--cbESCz-j4Vp7ufVK3tPV/view?usp=sharing"
src_perf_1617ks4 <- read_csv_gdrive(src_perf_1617ks4_path)
# 2017-18 KS2
src_perf_1718ks2_path <- "https://drive.google.com/file/d/1zJxcJrsOr7nvWwb6Yyg48QoMgSp76e8V/view?usp=sharing"
src_perf_1718ks2 <- read_csv_gdrive(src_perf_1718ks2_path)
# 2017-18 KS4
src_perf_1718ks4_path <- "https://drive.google.com/file/d/1-5WDJ5dZulWmqQa8fX5eczA0Bo2m_jEc/view?usp=sharing"
src_perf_1718ks4 <- read_csv_gdrive(src_perf_1718ks4_path)
# 2018-19 KS2
src_perf_1819ks2_path <- "https://drive.google.com/file/d/1CmXvDtDOmIlXXUzZ9SPW4Yoti-rTzi75/view?usp=sharing"
src_perf_1819ks2 <- read_csv_gdrive(src_perf_1819ks2_path)
# 2018-19 KS4
src_perf_1819ks4_path <- "https://drive.google.com/file/d/1uBhU0yeuV97d66PRR60pq8UaRWvdGIaA/view?usp=sharing"
src_perf_1819ks4 <- read_csv_gdrive(src_perf_1819ks4_path)
These files contain a large number of columns. We will need to pick out only the most relevant for reporting, which can be standardised across all schools.
This dataset includes details about the educational phase. Since there are some schools in England that offer both primary and secondary education, we will need a way to distinguish the phases in the fact table.
# Read in the source data for phase
src_phase_path <- "https://drive.google.com/file/d/1I9Ibo-hG_VzXvzNgOVMi6PPYfSCuJzgG/view?usp=sharing"
src_phase <- read_csv_gdrive(src_phase_path)
This dataset has 2 rows and 4 columns and contains largely descriptive information about the educational phases in England.
This dataset includes details about the academic years covered by the data.
# Read in the source data for calendar
src_calendar_path <- "https://drive.google.com/file/d/1el2cL02nQidzN1pMcZ7AUb3h6wXuzN_E/view?usp=sharing"
src_calendar <- read_csv_gdrive(src_calendar_path)
This will be our date table. Since the lowest unit of time in the data is year, this is the level of detail of the calendar table. This table has 100 rows and 6 columns.
One key piece of cleaning we need to do with this dataset is to provide
a mechanism to group entities that describe the same establishment
(i.e. where there has been a change in URN over time due to conversion
to academy). We can do this using the links
columns at the end of this
data frame, along with the urn
, to create a bridge table, which we’ll
call brg_school
.
The links I am interested in are the predecessor-successor links so that I can keep schools grouped over time. I want the successor school to be retained as the master school and all predecessors to be linked to the master. If a school has no links then we simply list the master and linked URNs as the same number. Therefore, I need to keep all of the open schools as the master list and any predecessors need to be linked to those master URNs.
# Set up the bridge table
brg_school <- src_info %>%
# keep only currently open schools
filter(grepl("^Open.*", establishment_status_name)) %>%
# select required columns
select(master_urn = urn,
status = establishment_status_name,
starts_with("link_")) %>%
# pivot the links
pivot_longer(!c(master_urn, status),
values_drop_na = TRUE) %>%
# drop the name column
select(-name) %>%
# drop the non-useful link types
mutate(value = case_when(
value == "Does not have links" ~ value,
grepl(".*Predecessor.*", value) ~ value,
TRUE ~ "No useful links"
)) %>%
# split the description column into useful data
mutate(linked_urn = as.numeric(str_extract(value,"\\d+")),
link_type = str_trim(str_extract(value,"\\D+"))) %>%
# drop the value column
select(-value) %>%
# add self-urn
mutate(self_urn = master_urn) %>%
pivot_longer(cols = c(linked_urn, self_urn),
names_to = "urn_type",
values_to = "linked_urn",
values_drop_na = TRUE) %>%
# select required columns
select(master_urn,
linked_urn) %>%
distinct()
Now that the bridge table is set up I can filter the src_info
table to
include only the currently open schools. The number of distinct URNs in
this table should then match the number of distinct master URNs in
brg_school
.
# Keep only the currently open schools in info
info <- src_info %>%
filter(grepl("^Open.*",
establishment_status_name))
# Check if the number of distinct URNs in info matches the number of
# distinct master URNs in brg_school
n_distinct(info$urn) == n_distinct(brg_school$master_urn)
## [1] TRUE
The numbers match. Success!
Next, let’s keep only the columns we need from src_info
.
# Keep only the required columns from src_info
info <- info %>%
select(urn:establishment_status_name,
open_date,
close_date,
phase_of_education_name:statutory_high_age,
gender_name,
religious_character_name,
admissions_policy_name,
school_capacity,
number_of_pupils:percentage_fsm,
ofsted_last_insp:head_preferred_job_title,
gor_name:lsoa_name,
ofsted_rating_name,
msoa_code:fsm)
Now we can rename the columns more appropriately.
# Rename columns in info
info <- info %>%
rename(establishment_type = type_of_establishment_name,
establishment_type_group = establishment_type_group_name,
establishment_status = establishment_status_name,
education_phase = phase_of_education_name,
gender = gender_name,
religious_character = religious_character_name,
admissions_policy = admissions_policy_name,
county = county_name,
head_title = head_title_name,
gor = gor_name,
administrative_district = district_administrative_name,
administrative_ward = administrative_ward_name,
parliamentary_constituency = parliamentary_constituency_name,
urban_rural_indicator = urban_rural_name,
gss_la_code = gssla_code_name,
msoa = msoa_name,
lsoa = lsoa_name,
ofsted_rating = ofsted_rating_name,
number_of_fsm = fsm)
And convert the columns to the correct data types.
# Define the columns by type
info_date <- c("open_date",
"close_date",
"ofsted_last_insp",
"last_changed_date")
# Correct the data types in info
info <- info %>%
mutate(across(all_of(info_date), dmy)) %>%
mutate(telephone_num = as.character(telephone_num),
telephone_num = paste0("0", telephone_num))
Let’s now review the selected column headings in this dataset and the number of unique values in each column
# Print a list of the column headings in the info df
info %>%
summarise_all(n_distinct) %>%
transpose(keep.names = "field") %>%
rename(unique_values = V1)
## field unique_values
## 1 urn 20188
## 2 la_code 152
## 3 la_name 152
## 4 establishment_number 2531
## 5 establishment_name 18522
## 6 establishment_type 9
## 7 establishment_type_group 3
## 8 establishment_status 2
## 9 open_date 311
## 10 close_date 8
## 11 education_phase 5
## 12 statutory_low_age 14
## 13 statutory_high_age 13
## 14 gender 4
## 15 religious_character 29
## 16 admissions_policy 4
## 17 school_capacity 1488
## 18 number_of_pupils 1735
## 19 number_of_boys 1050
## 20 number_of_girls 1065
## 21 percentage_fsm 708
## 22 ofsted_last_insp 1487
## 23 last_changed_date 294
## 24 street 13495
## 25 locality 6591
## 26 address3 991
## 27 town 1427
## 28 county 58
## 29 postcode 19440
## 30 school_website 19603
## 31 telephone_num 20035
## 32 head_title 19
## 33 head_first_name 2252
## 34 head_last_name 8079
## 35 head_preferred_job_title 152
## 36 gor 9
## 37 administrative_district 309
## 38 administrative_ward 6249
## 39 parliamentary_constituency 533
## 40 urban_rural_indicator 10
## 41 gss_la_code 153
## 42 easting 19149
## 43 northing 19321
## 44 msoa 6579
## 45 lsoa 15565
## 46 ofsted_rating 6
## 47 msoa_code 6579
## 48 lsoa_code 15565
## 49 number_of_fsm 525
The la_name
and la_code
fields need to be checked to ensure they are
consistent. I refer to the Office for National
Statistics Local Authority Districts (December
2021) Names and Codes in the United
Kingdom
document for the latest reference data for these fields. This document
is included in the Google Drive ref
folder.
Let’s review the unique values of LA code and name in the info
df.
# Review the unique values in la_name
info %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code, gss_la_code)
## # A tibble: 184 × 3
## la_code gss_la_code la_name
## <dbl> <chr> <chr>
## 1 301 E09000002 Barking and Dagenham
## 2 302 E09000003 Barnet
## 3 302 X999999 Barnet
## 4 370 E08000016 Barnsley
## 5 800 E06000022 Bath and North East Somerset
## 6 800 X999999 Bath and North East Somerset
## 7 822 E06000055 Bedford
## 8 303 E09000004 Bexley
## 9 330 E08000025 Birmingham
## 10 330 X999999 Birmingham
## # … with 174 more rows
## # ℹ Use `print(n = ...)` to see more rows
We can see that some of the gss-la-code
values here are wrong. It
looks like a value of X999999
has been used where the true value could
not be determined when the source data was put together. We need to
replace this value with the correct value according to the reference
table.
# Correct the erroneous gss_la_code fields
info <- info %>%
group_by(la_code, la_name) %>%
mutate(gss_la_code = min(gss_la_code)) %>%
ungroup()
# Check the results
info_la <- info %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code, gss_la_code)
Some columns have relatively few unique values, making these columns clearly categorical. I’d like to review the values to ensure they are consistent and well labelled.
# Review the unique values in establishment type
unique(info$establishment_type)
## [1] "Voluntary aided school" "Community school"
## [3] "Foundation school" "Voluntary controlled school"
## [5] "Academy sponsor led" "Academy converter"
## [7] "Free schools" "Studio schools"
## [9] "University technical college"
These values all look ok. This page has some useful definitions of the different types of school.
# Review the unique values in establishment group
unique(info$establishment_type_group)
## [1] "Local authority maintained schools" "Academies"
## [3] "Free Schools"
All ok.
# Review the unique values in establishment status
unique(info$establishment_status)
## [1] "Open" "Open, but proposed to close"
All ok.
# Review the unique values in phase of education
unique(info$education_phase)
## [1] "Primary" "Secondary"
## [3] "All-through" "Middle deemed secondary"
## [5] "Middle deemed primary"
All ok.
# Review the unique values in gender
unique(info$gender)
## [1] "Mixed" "Girls" "Boys" NA
Some missing values here. Let’s replace those with ‘Not reported’.
# Convert missing values in gender
info$gender[is.na(info$gender)] <- "Not reported"
# Check the results
unique(info$gender)
## [1] "Mixed" "Girls" "Boys" "Not reported"
All ok.
# Review the unique values in religious character
unique(info$religious_character)
## [1] "Church of England"
## [2] "Does not apply"
## [3] "Roman Catholic"
## [4] "Catholic"
## [5] "None"
## [6] "Jewish"
## [7] "Muslim"
## [8] "Church of England/Christian"
## [9] "Church of England/Methodist"
## [10] "Church of England/Free Church"
## [11] "Methodist"
## [12] "Multi-faith"
## [13] "Church of England/Roman Catholic"
## [14] "Christian/non-denominational"
## [15] "Roman Catholic/Church of England"
## [16] "Christian"
## [17] NA
## [18] "United Reformed Church"
## [19] "Quaker"
## [20] "Church of England/United Reformed Church"
## [21] "Sikh"
## [22] "Seventh Day Adventist"
## [23] "Hindu"
## [24] "Greek Orthodox"
## [25] "Orthodox Jewish"
## [26] "Anglican"
## [27] "Anglican/Church of England"
## [28] "Methodist/Church of England"
## [29] "Church of England/Methodist/United Reform Church/Baptist"
Again, let’s fix the missing values.
# Convert missing values in religious character
info$religious_character[is.na(info$religious_character)] <-
"Not reported"
# Check the results
unique(info$religious_character)
## [1] "Church of England"
## [2] "Does not apply"
## [3] "Roman Catholic"
## [4] "Catholic"
## [5] "None"
## [6] "Jewish"
## [7] "Muslim"
## [8] "Church of England/Christian"
## [9] "Church of England/Methodist"
## [10] "Church of England/Free Church"
## [11] "Methodist"
## [12] "Multi-faith"
## [13] "Church of England/Roman Catholic"
## [14] "Christian/non-denominational"
## [15] "Roman Catholic/Church of England"
## [16] "Christian"
## [17] "Not reported"
## [18] "United Reformed Church"
## [19] "Quaker"
## [20] "Church of England/United Reformed Church"
## [21] "Sikh"
## [22] "Seventh Day Adventist"
## [23] "Hindu"
## [24] "Greek Orthodox"
## [25] "Orthodox Jewish"
## [26] "Anglican"
## [27] "Anglican/Church of England"
## [28] "Methodist/Church of England"
## [29] "Church of England/Methodist/United Reform Church/Baptist"
All ok.
# Review the unique values in admissions policy
unique(info$admissions_policy)
## [1] "Not applicable" "Non-selective" "Selective" NA
Again let’s fix the missing values.
# Convert missing values in admissions policy
info$admissions_policy[is.na(info$admissions_policy)] <-
"Not reported"
# Check the results
unique(info$admissions_policy)
## [1] "Not applicable" "Non-selective" "Selective" "Not reported"
All ok.
# Review the unique values in ofsted rating
unique(info$ofsted_rating)
## [1] "Outstanding" "Good" "Requires improvement"
## [4] "Special Measures" "Serious Weaknesses" NA
Some recoding is necessary here. According to Ofsted, school inspections use a 4-point grading scale:
- grade 1 - outstanding
- grade 2 - good
- grade 3 - requires improvement
- grade 4 - inadequate
In our dataset, I see some other values of Serious Weaknesses
and
Special Measures
, which are not on the ratings list provided by
Ofsted. In fact, both of these ratings are a
subset
of the “inadequate” (grade 4) rating. Therefore, I can recode both of
these values as inadequate
. I can also add in the grade points to the
ratings dimension table to provide a quantitative and aggregatable
measure.
# Fix values for ofsted rating
info <- info %>%
mutate(ofsted_rating = case_when(
grepl("Weakness", ofsted_rating) ~ "Inadequate",
grepl("Measures", ofsted_rating) ~ "Inadequate",
is.na(ofsted_rating) ~ "Not reported",
TRUE ~ ofsted_rating
),
ofsted_rating_score = case_when(
grepl("Outstanding", ofsted_rating) ~ 1,
grepl("Good", ofsted_rating) ~ 2,
grepl("improvement", ofsted_rating) ~ 3,
grepl("Inadequate", ofsted_rating) ~ 4
))
# Check results
unique(info$ofsted_rating)
## [1] "Outstanding" "Good" "Requires improvement"
## [4] "Inadequate" "Not reported"
unique(info$ofsted_rating_score)
## [1] 1 2 3 4 NA
From the src_offla
dataset we need to keep the identifiers for local
authority, year and phase as well as the numerical data. We should
filter the source data to keep only data at the geographic_level
of
“Local authority”.
# Select the required columns from offla and correct data types
offla <- src_offla %>%
mutate(across(everything(), as.character)) %>%
filter(geographic_level == "Local authority") %>%
select(-c(nc_year_admission,
time_identifier,
geographic_level,
country_code,
country_name,
one_of_the_three_preference_offers,
ends_with("_percent"))) %>%
type_convert(na = "c")
Now we can rename the columns more appropriately.
# Rename the columns
offla <- offla %>%
rename(year = time_period,
gss_la_code = new_la_code,
la_code = old_la_code,
phase = school_phase,
number_of_preferences = no_of_preferences)
Now let’s review the entries for the local authority identifiers and
check which rows in offla
are not present in info_la
and therefore
need to be changed.
# Review the unique values of la_name
offla %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code) %>%
anti_join(info_la)
## # A tibble: 7 × 3
## la_code gss_la_code la_name
## <dbl> <chr> <chr>
## 1 837 E06000028 Bournemouth
## 2 825 E10000002 Buckinghamshire
## 3 835 E10000009 Dorset
## 4 390 E08000020 Gateshead
## 5 928 E10000021 Northamptonshire
## 6 929 E06000048 Northumberland
## 7 836 E06000029 Poole
Let’s make the necessary replacements.
# Recode the LA identifier fields in offla
offla <- offla %>%
mutate(
la_code = case_when(
la_code == 837 ~ 839,
la_code == 835 ~ 838,
la_code == 928 ~ 940,
la_code == 836 ~ 839,
TRUE ~ la_code
),
gss_la_code = case_when(
gss_la_code == "E06000028" ~ "E06000058",
gss_la_code == "E10000002" ~ "E06000060",
gss_la_code == "E10000009" ~ "E06000059",
gss_la_code == "E08000020" ~ "E08000037",
gss_la_code == "E10000021" ~ "E06000061",
gss_la_code == "E06000048" ~ "E06000057",
gss_la_code == "E06000029" ~ "E06000058",
TRUE ~ gss_la_code
),
la_name = case_when(
la_name == "Bournemouth" ~ "Bournemouth, Christchurch and Poole",
la_name == "Northamptonshire" ~ "North Northamptonshire",
la_name == "Poole" ~ "Bournemouth, Christchurch and Poole",
TRUE ~ la_name
)
)
# Extra rows because Northamptonshire is now two distinct regions
offla_extra <- tibble(
year = c(202122, 202122, 202021, 202021, 201920, 201920, 201819, 201819,
201718, 201718, 201617, 201617, 201516, 201516, 201415, 201415),
region_code = rep("E12000004", 16),
region_name = rep("East Midlands", 16),
gss_la_code = rep("E06000062", 16),
la_code = rep(941, 16),
la_name = rep("West Northamptonshire", 16),
phase = rep(c("Primary", "Secondary"), 8),
number_of_preferences = rep(3,16)
)
# Bind extra rows on to prefs
offla <- offla %>%
bind_rows(offla_extra) %>%
group_by_at(vars(-number_of_preferences)) %>%
slice_max(number_of_preferences) %>%
ungroup() %>%
distinct()
# Check the results
offla_la <- offla %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
offla_la %>%
anti_join(info_la)
## # A tibble: 0 × 3
## # … with 3 variables: la_code <dbl>, gss_la_code <chr>, la_name <chr>
## # ℹ Use `colnames()` to see all variable names
# Review the unique values of year
unique(offla$year)
## [1] 201415 201516 201617 201718 201819 201920 202021 202122 202223
These values should be recoded to match the year_key
in the calendar
table. We need to extract the academic start year, which is the first 4
digits of the value.
# Recode the values in year
offla$year <- as.numeric(str_extract(offla$year, "^\\d{4}"))
# Check the results
unique(offla$year)
## [1] 2014 2015 2016 2017 2018 2019 2020 2021 2022
Let’s keep only the columns we need from src_offers
. Some of the
columns are duplicates of columns already in the info
table, and some
are just not required.
# Select required columns from offers
offers <- src_offers %>%
select(time_period,
region_code,
region_name,
old_la_code,
new_la_code,
la_name,
total_number_places_offered:offers_to_applicants_from_another_la,
denomination,
school_urn,
entry_year)
Next we can rename the columns more appropriately.
# Rename the columns
offers <- offers %>%
rename(year = time_period,
la_code = old_la_code,
gss_la_code = new_la_code,
religious_denomination = denomination,
urn = school_urn,
phase = entry_year)
Next we can filter out any records where the school_urn
is n/a
and
convert the remaining values to numeric.
# Remove entries with null URN
offers <- offers %>%
filter(urn != "n/a") %>%
mutate(urn = as.numeric(urn))
Now we can check and correct the data types.
# Define cols to be converted
offers_num <- c("proportion_1stprefs_v_1stprefoffers",
"proportion_1stprefs_v_totaloffers",
"offers_to_applicants_from_another_la")
# Correct data types in offers
offers <- offers %>%
mutate(across(all_of(offers_num), as.numeric))
Let’s now review the selected column headings in this dataset and the number of unique values in each column.
# Print a list of the column headings in the offers df
offers %>%
summarise_all(n_distinct) %>%
transpose(keep.names = "field") %>%
rename(unique_values = V1)
## field unique_values
## 1 year 9
## 2 region_code 10
## 3 region_name 10
## 4 la_code 155
## 5 gss_la_code 155
## 6 la_name 157
## 7 total_number_places_offered 425
## 8 number_preferred_offers 425
## 9 number_1st_preference_offers 403
## 10 number_2nd_preference_offers 108
## 11 number_3rd_preference_offers 61
## 12 times_put_as_any_preferred_school 1599
## 13 times_put_as_1st_preference 593
## 14 times_put_as_2nd_preference 539
## 15 times_put_as_3rd_preference 438
## 16 proportion_1stprefs_v_1stprefoffers 9826
## 17 proportion_1stprefs_v_totaloffers 15429
## 18 all_applications_from_another_la 760
## 19 offers_to_applicants_from_another_la 197
## 20 religious_denomination 3
## 21 urn 24404
## 22 phase 3
Let’s review the categorical values to ensure they are consistent and well labelled.
# Review the unique values of la_name
offers %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
## # A tibble: 160 × 3
## la_code gss_la_code la_name
## <dbl> <chr> <chr>
## 1 301 E09000002 Barking and Dagenham
## 2 302 E09000003 Barnet
## 3 370 E08000016 Barnsley
## 4 800 E06000022 Bath and North East Somerset
## 5 822 E06000055 Bedford
## 6 303 E09000004 Bexley
## 7 330 E08000025 Birmingham
## 8 889 E06000008 Blackburn with Darwen
## 9 890 E06000009 Blackpool
## 10 350 E08000001 Bolton
## # … with 150 more rows
## # ℹ Use `print(n = ...)` to see more rows
Some of these entries need to be corrected to match the values in
info
.
offers_la <- offers %>%
select(urn,
year,
la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct() %>%
arrange(urn,
-year,
la_code,
la_name) %>%
group_by(urn) %>%
slice_head(n = 1) %>%
ungroup() %>%
inner_join(brg_school,
by = c("urn" = "linked_urn")) %>%
left_join(select(info,
master_urn = urn,
master_la_code = la_code,
master_gss_la_code = gss_la_code,
master_la_name = la_name)) %>%
mutate(la_code = coalesce(master_la_code, la_code),
gss_la_code = coalesce(master_gss_la_code, gss_la_code),
la_name = coalesce(master_la_name, la_name)) %>%
select(urn,
master_urn,
la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct()
offers <- offers %>%
select(-c(region_code,
region_name,
la_code,
gss_la_code,
la_name)) %>%
inner_join(offers_la)
# Check the results
offers_la <- offers %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
# Review the unique values of year
unique(offers$year)
## [1] 202223 202122 202021 201920 201819 201718 201617 201516 201415
These values should be recoded to match the year_key
in the calendar
table. We need to extract the academic start year, which is the first 4
digits of the value.
# Recode the values in year
offers$year <- as.numeric(str_extract(offers$year, "^\\d{4}"))
# Check the results
unique(offers$year)
## [1] 2022 2021 2020 2019 2018 2017 2016 2015 2014
All ok.
# Review the unique values in denomination
unique(offers$religious_denomination)
## [1] "Faith" "No religious character" "n/a"
There are some values of n/a
here. Let’s see what those are.
# Review the n/a values in denomination
offers %>%
filter(religious_denomination == "n/a") %>%
head()
## # A tibble: 6 × 23
## year total_…¹ numbe…² numbe…³ numbe…⁴ numbe…⁵ times…⁶ times…⁷ times…⁸ times…⁹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022 61 61 61 0 0 132 98 20 14
## 2 2022 30 30 29 1 0 101 44 47 10
## 3 2022 30 30 29 1 0 79 38 32 9
## 4 2022 35 34 32 2 0 68 32 20 16
## 5 2022 59 59 58 1 0 137 68 57 12
## 6 2022 30 30 30 0 0 112 36 49 26
## # … with 13 more variables: proportion_1stprefs_v_1stprefoffers <dbl>,
## # proportion_1stprefs_v_totaloffers <dbl>,
## # all_applications_from_another_la <dbl>,
## # offers_to_applicants_from_another_la <dbl>, religious_denomination <chr>,
## # urn <dbl>, phase <chr>, master_urn <dbl>, la_code <dbl>, gss_la_code <chr>,
## # la_name <chr>, region_code <chr>, region_name <chr>, and abbreviated
## # variable names ¹total_number_places_offered, ²number_preferred_offers, …
## # ℹ Use `colnames()` to see all variable names
I guess this must be for schools who did not report their denomination.
Let’s replace n/a
with Not reported
.
# Replace n/a
offers$religious_denomination[offers$religious_denomination == "n/a"] <- "Not reported"
# Check the results
unique(offers$religious_denomination)
## [1] "Faith" "No religious character" "Not reported"
All ok.
# Review the unique values in phase
unique(offers$phase)
## [1] "R" "7" "9"
Here we need to make some changes so that this column can be used to
refer to the educational phase of the data. We need to recode these
values to match the phases in our phase table (primary or secondary).
The value R
corresponds to primary and 7
and 9
both correspond to
secondary.
# Recode the phase values
offers <- offers %>%
mutate(phase = case_when(
phase == "R" ~ "Primary",
phase %in% c("7", "9") ~ "Secondary",
TRUE ~ phase
))
# Check the results
unique(offers$phase)
## [1] "Primary" "Secondary"
From this dataset we really need to select one or two good measures of performance for KS2 and KS4, along with the URN so that we can join back to school.
- For KS2 we’ll used the progress measures
readprog
,writprog
andmatprog
. This article has a good explanation of the KS2 progress scores, and this article explains how the bandings can be calculated. - For KS4 we’ll use the progress 8 and attainment 8 measures
p8mea
andatt8scr
. This article has a good explanation of these measures.
We will further need to filter both the KS2 and KS4 data to include only
rows that are aggregated at the mainstream school level (RECTYPE = 1
).
# Clean the KS2 data
# Bind rows from all years
perf_ks2 <- bind_rows(
"2015" = src_perf_1516ks2,
"2016" = src_perf_1617ks2,
"2017" = src_perf_1718ks2,
"2018" = src_perf_1819ks2,
.id = "year"
) %>%
# Filter to rectype == 1
filter(rectype == 1) %>%
# Select the required columns
select(year,
urn,
readprog,
writprog,
matprog) %>%
# convert to numeric (any strings can be converted to NA)
mutate(across(everything(), as.numeric)) %>%
# add a column for phase
mutate(phase = "Primary")
# View the head
head(perf_ks2)
## # A tibble: 6 × 6
## year urn readprog writprog matprog phase
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2015 100000 2.7 2.2 3 Primary
## 2 2015 100028 2.6 4 3.5 Primary
## 3 2015 100029 2.3 1.4 0.3 Primary
## 4 2015 130342 3.4 2.2 2.3 Primary
## 5 2015 100013 2 1.4 6 Primary
## 6 2015 100027 3.5 0.4 2.7 Primary
# Bind rows from all years
perf_ks4 <- bind_rows(
"2015" = select(src_perf_1516ks4, rectype, urn, p8mea, att8scr),
"2016" = select(src_perf_1617ks4, rectype, urn, p8mea, att8scr),
"2017" = select(src_perf_1718ks4, rectype, urn, p8mea, att8scr),
"2018" = select(src_perf_1819ks4, rectype, urn, p8mea, att8scr),
.id = "year"
) %>%
# Filter to rectype == 1
filter(rectype == 1) %>%
# Select the required columns
select(year,
urn,
p8mea,
att8scr) %>%
# convert to numeric (any strings can be converted to NA)
mutate(across(everything(), as.numeric)) %>%
# add the phase
mutate(phase = "Secondary")
# View the head
head(perf_ks4)
## # A tibble: 6 × 5
## year urn p8mea att8scr phase
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2015 100003 NA 42.1 Secondary
## 2 2015 100001 NA 32.2 Secondary
## 3 2015 100053 -0.26 50.1 Secondary
## 4 2015 100054 0.31 60.1 Secondary
## 5 2015 137333 NA 3.4 Secondary
## 6 2015 100084 NA 49.4 Secondary
Now we can join the KS2 and KS4 data together and recode the progress measure names.
# Combine two data frames
performance <- perf_ks2 %>%
full_join(perf_ks4) %>%
rename(reading_progress = readprog,
writing_progress = writprog,
maths_progress = matprog,
progress_8 = p8mea,
attainment_8 = att8scr)
# View the head
head(performance)
## # A tibble: 6 × 8
## year urn reading_progress writing_progress maths_p…¹ phase progr…² attai…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2015 100000 2.7 2.2 3 Prim… NA NA
## 2 2015 100028 2.6 4 3.5 Prim… NA NA
## 3 2015 100029 2.3 1.4 0.3 Prim… NA NA
## 4 2015 130342 3.4 2.2 2.3 Prim… NA NA
## 5 2015 100013 2 1.4 6 Prim… NA NA
## 6 2015 100027 3.5 0.4 2.7 Prim… NA NA
## # … with abbreviated variable names ¹maths_progress, ²progress_8, ³attainment_8
No cleaning required. We can simply rename without the src_
prefix.
# Rename the variable
phase <- src_phase
No cleaning required. We can simply rename without the src_
prefix.
# Rename the variable
calendar <- src_calendar
The matrix below shows the fact and dimension tables that I intend to create for this model and the relationships between them.
Fact/Dimension | dim_la | dim_school | dim_calendar | dim_phase |
---|---|---|---|---|
fact_la_year | x | x | x | |
fact_school_year | x | x | x | x |
fact_school | x | x |
Fact and dimension tables
To build out the dimensional model for this data, we need to review all
of the columns we intend to use, so that we can determine where they fit
in the model. The custom function describe_df()
contained in the
utils.R
file can help us with this.
# Create a vector of dfs to describe
data <- list("info" = info,
"offers_by_la" = offla,
"offers" = offers,
"performance" = performance,
"phase" = phase,
"calendar" = calendar)
# Run describe_df() over each df
t_data <- lapply(data, describe_df)
# Bind each df into a single df
t_data <- bind_rows(t_data, .id = "source_table")
After exporting the new t_data
dataframe to a spreadsheet, I created a
file named
star-schema
which maps the raw source data to the corresponding dimension and fact
tables for the star schema model.
# Read in star schema planning document
star_schema_path <- "https://docs.google.com/spreadsheets/d/1rrlhlvFrnPpTQtn2YzWyJZzwlm-2wWAzDiEv6fKBoHU/edit?usp=sharing"
star_schema <- read_sheet(star_schema_path,
col_types = "c")
The following columns are required for the local authority dimension table:
# Which columns are needed for the local authority dimension?
star_schema %>%
filter(model_table == "dim_la") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
info | la_code | 891 | dim_la |
info | la_name | Nottinghamshire | dim_la |
info | gss_la_code | E10000024 | dim_la |
offers_by_la | region_code | E12000001 | dim_la |
offers_by_la | region_name | North East | dim_la |
offers_by_la | gss_la_code | E06000004 | dim_la |
offers_by_la | la_code | 808 | dim_la |
offers_by_la | la_name | Stockton-on-Tees | dim_la |
offers | region_code | E13000001 | dim_la |
offers | region_name | Inner London | dim_la |
offers | la_code | 201 | dim_la |
offers | la_name | City of London | dim_la |
offers | gss_la_code | E09000001 | dim_la |
These columns can all come from the offers
and info
data frames.
# Build out the local authority dimension table
dim_la <- offers %>%
select(la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct() %>%
full_join(info_la) %>%
mutate(region_code = case_when(la_code == 420 ~ "E12000009",
TRUE ~ region_code),
region_name = case_when(la_code == 420 ~ "South West",
TRUE ~ region_name))
# Check the result
head(dim_la)
## # A tibble: 6 × 5
## la_code gss_la_code la_name region_code region_name
## <dbl> <chr> <chr> <chr> <chr>
## 1 201 E09000001 City of London E13000001 Inner London
## 2 202 E09000007 Camden E13000001 Inner London
## 3 203 E09000011 Greenwich E13000002 Outer London
## 4 204 E09000012 Hackney E13000001 Inner London
## 5 205 E09000013 Hammersmith and Fulham E13000001 Inner London
## 6 206 E09000019 Islington E13000001 Inner London
Let’s assign a unique key to each record, and add a record for the
null
case, which we’ll call Not reported
.
# Specify the null value
dim_la_null <- tibble(la_key = 0,
la_code = 0,
la_name = "Not reported",
gss_la_code = "Not reported",
region_code = "Not reported",
region_name = "Not reported")
# Assign a key and row for null values
dim_la <- dim_la %>%
rowid_to_column(var = "la_key") %>%
bind_rows(dim_la_null) %>%
arrange(la_key)
The following fields are required for the school dimension.
# Which columns are needed for the school dimension?
star_schema %>%
filter(model_table == "dim_school") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
info | urn | 131560 | dim_school |
info | establishment_number | 3793 | dim_school |
info | establishment_name | Blidworth Oaks Primary School | dim_school |
info | establishment_type | Community school | dim_school |
info | establishment_type_group | Local authority maintained schools | dim_school |
info | establishment_status | Open, but proposed to close | dim_school |
info | open_date | 13604 | dim_school |
info | close_date | 19204 | dim_school |
info | education_phase | Primary | dim_school |
info | statutory_low_age | 2 | dim_school |
info | statutory_high_age | 11 | dim_school |
info | gender | Mixed | dim_school |
info | religious_character | Does not apply | dim_school |
info | admissions_policy | Not applicable | dim_school |
info | ofsted_last_insp | 17344 | dim_school |
info | last_changed_date | 19166 | dim_school |
info | street | Haywood Avenue | dim_school |
info | locality | Blidworth | dim_school |
info | address3 | Blidworth Oaks Primary School | dim_school |
info | town | Mansfield | dim_school |
info | county | Nottinghamshire | dim_school |
info | postcode | NG21 0RE | dim_school |
info | school_website | www.blidworthoaks.co.uk | dim_school |
info | telephone_num | 1623792348 | dim_school |
info | head_title | Mr | dim_school |
info | head_first_name | Shaun | dim_school |
info | head_last_name | Walker | dim_school |
info | head_preferred_job_title | Headteacher | dim_school |
info | gor | East Midlands | dim_school |
info | administrative_district | Newark and Sherwood | dim_school |
info | administrative_ward | Rainworth South & Blidworth | dim_school |
info | parliamentary_constituency | Sherwood | dim_school |
info | urban_rural_indicator | (England/Wales) Rural town and fringe | dim_school |
info | easting | 459277 | dim_school |
info | northing | 356444 | dim_school |
info | msoa | Newark and Sherwood 006 | dim_school |
info | lsoa | Newark and Sherwood 006B | dim_school |
info | ofsted_rating | Good | dim_school |
info | msoa_code | E02005898 | dim_school |
info | lsoa_code | E01028298 | dim_school |
offers | religious_denomination | Faith | dim_school |
offers | urn | 100000 | dim_school |
performance | urn | 137157 | dim_school |
These columns can all be found in the info
and offers
tables. We
need to ensure that we include only the URNs in the master_urn
column
of our bridge table.
# Define columns to include
dim_school_cols_info <- star_schema %>%
filter(model_table == "dim_school",
source_table == "info") %>%
pull(source_field)
dim_school_cols_offers <- star_schema %>%
filter(model_table == "dim_school",
source_table == "offers") %>%
pull(source_field)
# Define which rows from offers should be joined (as there can be more than
# one record per URN)
dim_school_offers <- offers %>%
select(all_of(dim_school_cols_offers),
year,
phase) %>%
distinct() %>%
group_by(urn) %>%
slice_max(year) %>%
slice_max(phase) %>%
ungroup()
# Build out the school dimension table
dim_school <- brg_school %>%
select(urn = master_urn) %>%
distinct() %>%
left_join(select(info,
all_of(dim_school_cols_info))) %>%
left_join(select(dim_school_offers,
all_of(dim_school_cols_offers)))
Let’s assign a unique key to each record, and add a record for the
null
case, which we’ll call Not reported
.
# Specify the null value
dim_school_null <- tibble(school_key = 0,
urn = 0,
establishment_number = 0,
establishment_name = "Not reported",
establishment_type = "Not reported",
establishment_type_group = "Not reported",
establishment_status = "Not reported",
open_date = NA,
close_date = NA,
education_phase = "Not reported",
statutory_low_age = NA,
statutory_high_age = NA,
gender = "Not reported",
religious_character = "Not reported",
admissions_policy = "Not reported",
ofsted_last_insp = NA,
last_changed_date = NA,
gor = "Not reported",
administrative_district = "Not reported",
administrative_ward = "Not reported",
parliamentary_constituency = "Not reported",
urban_rural_indicator = "Not reported",
msoa_code = "Not reported",
lsoa_code = "Not reported",
ofsted_rating = "Not reported",
religious_denomination = "Not reported")
# Assign a key and row for null values
dim_school <- dim_school %>%
rowid_to_column(var = "school_key") %>%
bind_rows(dim_school_null) %>%
arrange(school_key)
The following columns are needed for the phase dimension:
# Which columns are needed for the phase dimension?
star_schema %>%
filter(model_table == "dim_phase") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
offers_by_la | phase | Primary | dim_phase |
offers | phase | Primary | dim_phase |
performance | phase | Primary | dim_phase |
phase | phase_key | 1 | dim_phase |
phase | phase_name | Primary | dim_phase |
phase | phase_intake_year | Reception | dim_phase |
phase | performance_key_stage | KS2 | dim_phase |
For this table we just need to use the phase
data frame as it already
exists.
# Build the dim_phase table
dim_phase <- phase
For this table we just need to use the calendar
data frame as it
already exists.
# Build the calendar dimension
dim_calendar <- calendar
To build the school facts table we will need to following columns, plus keys for all dimensions:
# Which columns are needed for the school fact table?
star_schema %>%
filter(model_table == "fact_school") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
info | school_capacity | 315 | fact_school |
info | number_of_pupils | 382 | fact_school |
info | number_of_boys | 208 | fact_school |
info | number_of_girls | 174 | fact_school |
info | percentage_fsm | 30.6 | fact_school |
info | number_of_fsm | 102 | fact_school |
info | ofsted_rating_score | 2 | fact_school |
These fields can all come from the info
table.
# Define columns to include
fct_school_cols <- star_schema %>%
filter(model_table == "fact_school",
source_table == "info") %>%
pull(source_field)
# Build out the school fact table
fct_school <- info %>%
# select columns
select(linked_urn = urn,
la_code,
la_name,
all_of(fct_school_cols)) %>%
# join school key (via bridge)
left_join(brg_school) %>%
left_join(select(dim_school,
master_urn = urn,
school_key)) %>%
# join la key
left_join(select(dim_la,
la_code,
la_name,
la_key)) %>%
# select final columns
select(school_key,
la_key,
all_of(fct_school_cols)) %>%
# deal with duplicates
group_by(school_key,
la_key) %>%
mutate(fct_school_count = n()) %>%
summarise_all(mean, na.rm = TRUE) %>%
ungroup() %>%
mutate(fct_school_flg = if_else(fct_school_count == 1,
"Actual",
"Average"))
To build the school year facts table we need the following columns:
# Which columns are needed for the school year fact table?
star_schema %>%
filter(model_table == "fact_school_year") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
offers | total_number_places_offered | 30 | fact_school_year |
offers | number_preferred_offers | 30 | fact_school_year |
offers | number_1st_preference_offers | 30 | fact_school_year |
offers | number_2nd_preference_offers | 0 | fact_school_year |
offers | number_3rd_preference_offers | 0 | fact_school_year |
offers | times_put_as_any_preferred_school | 81 | fact_school_year |
offers | times_put_as_1st_preference | 42 | fact_school_year |
offers | times_put_as_2nd_preference | 16 | fact_school_year |
offers | times_put_as_3rd_preference | 6 | fact_school_year |
offers | proportion_1stprefs_v_1stprefoffers | 1.4 | fact_school_year |
offers | proportion_1stprefs_v_totaloffers | 1.4 | fact_school_year |
offers | all_applications_from_another_la | 62 | fact_school_year |
offers | offers_to_applicants_from_another_la | 16 | fact_school_year |
performance | reading_progress | 2.8 | fact_school_year |
performance | writing_progress | 4 | fact_school_year |
performance | maths_progress | 0.5 | fact_school_year |
performance | progress_8 | 0.47 | fact_school_year |
performance | attainment_8 | 65 | fact_school_year |
These columns all come from offers
, prefs
and performance
.
# Define columns to include
fct_school_year_cols_offers <- star_schema %>%
filter(model_table == "fact_school_year",
source_table == "offers") %>%
pull(source_field)
fct_school_year_cols_perf <- star_schema %>%
filter(model_table == "fact_school_year",
source_table == "performance") %>%
pull(source_field)
# Build out the school year fact table
fct_school_year <- offers %>%
select(linked_urn = urn,
la_code,
la_name,
year,
phase_name = phase,
all_of(fct_school_year_cols_offers)) %>%
left_join(select(performance,
linked_urn = urn,
year,
phase_name = phase,
all_of(fct_school_year_cols_perf))) %>%
# join school key (via bridge)
inner_join(brg_school) %>%
left_join(select(dim_school,
master_urn = urn,
school_key)) %>%
# join la key
left_join(select(dim_la,
la_code,
la_name,
la_key)) %>%
# join phase key
left_join(select(dim_phase,
phase_name,
phase_key)) %>%
# select final columns
select(school_key,
la_key,
year_key = year,
phase_key,
all_of(fct_school_year_cols_offers),
all_of(fct_school_year_cols_perf)) %>%
# deal with duplicates
group_by(school_key,
la_key,
phase_key,
year_key) %>%
mutate(fct_school_year_count = n()) %>%
summarise_all(mean, na.rm = TRUE) %>%
ungroup() %>%
mutate(fct_school_year_flg = if_else(fct_school_year_count == 1,
"Actual",
"Average"))
To build the LA year facts table we need the following columns:
# Which columns are needed for the LA year fact table?
star_schema %>%
filter(model_table == "fact_la_year") %>%
kable()
source_table | source_field | source_value_example | model_table |
---|---|---|---|
offers_by_la | admission_numbers | 2626 | fact_la_year |
offers_by_la | applications_received | 2386 | fact_la_year |
offers_by_la | online_applications | 1638 | fact_la_year |
offers_by_la | number_of_preferences | 3 | fact_la_year |
offers_by_la | first_preference_offers | 2233 | fact_la_year |
offers_by_la | second_preference_offers | 97 | fact_la_year |
offers_by_la | third_preference_offers | 25 | fact_la_year |
offers_by_la | preferred_school_offer | 2355 | fact_la_year |
offers_by_la | non_preferred_offer | 0 | fact_la_year |
offers_by_la | no_offer | 31 | fact_la_year |
offers_by_la | schools_in_la_offer | 2333 | fact_la_year |
offers_by_la | schools_in_another_la_offer | 22 | fact_la_year |
offers_by_la | offers_to_nonapplicants | 0 | fact_la_year |
These columns all come from offers_by_la
.
# Define columns to include
fct_la_year_cols <- star_schema %>%
filter(model_table == "fact_la_year") %>%
pull(source_field)
# Build out the LA year fact table
fct_la_year <- offla %>%
select(la_code,
la_name,
year,
phase_name = phase,
all_of(fct_la_year_cols)) %>%
# join la key
left_join(select(dim_la,
la_code,
la_name,
la_key)) %>%
# join phase key
left_join(select(dim_phase,
phase_name,
phase_key)) %>%
# select final columns
select(la_key,
year_key = year,
phase_key,
all_of(fct_la_year_cols)) %>%
# deal with duplicates
group_by(la_key,
phase_key,
year_key) %>%
mutate(fct_la_year_count = n()) %>%
summarise_all(mean, na.rm = TRUE) %>%
ungroup() %>%
mutate(fct_la_year_flg = if_else(fct_la_year_count == 1,
"Actual",
"Average"))
Finally we can join together the three individual fact tables to create a single fact table which will be more useful for Tableau.
# Build a table for all facts
fct_all <- fct_school_year %>%
left_join(fct_school) %>%
left_join(fct_la_year)
The final tables for the data model need to be saved as Rdata files, so that they can be accessed by Tableau.
# Define folder location for save
export_path <- "data-out/"
# Save objects
save(dim_la, file = paste0(export_path,
"dim_la.Rda"))
save(dim_school, file = paste0(export_path,
"dim_school.Rda"))
save(dim_calendar, file = paste0(export_path,
"dim_calendar.Rda"))
save(dim_phase, file = paste0(export_path,
"dim_phase.Rda"))
save(fct_school, file = paste0(export_path,
"fct_school.Rda"))
save(fct_school_year, file = paste0(export_path,
"fct_school_year.Rda"))
save(fct_la_year, file = paste0(export_path,
"fct_la_year.Rda"))
save(fct_all, file = paste0(export_path,
"fct_all.Rda"))