-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-prep.Rmd
1290 lines (1049 loc) · 43 KB
/
data-prep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Source data preparation for the England school admissions dashboard"
author: "Clare Gibson"
output:
github_document:
toc: true
toc_depth: 3
---
# Setup
```{r setup}
# Load packages
library(knitr) # for Rmd options
library(tidyverse) # for general wrangling
library(data.table) # for transposing data frames
library(janitor) # for cleaning column headers
library(lubridate) # for working with dates
library(googlesheets4) # for reading in Google Sheets files
# Knitr Options
knitr::opts_chunk$set(
echo = TRUE,
fig.align = "center",
message = FALSE,
warning = FALSE
)
```
# Introduction
This notebook serves as the documented code to read and prepare data for the England School Admissions Dashboard project ([github](https://github.com/clarelgibson/england-school-admissions) \| [tableau](https://public.tableau.com/views/Schools_16505251102060/Home?:language=en-GB&:display_count=n&:origin=viz_share_link)).
The notebook begins by reading in the source data required for the dashboards. It then works through the data wrangling steps necessary to clean and prepare the data for transfer to Tableau.
## Source data
The full list of data sources, download links and retrieval steps can be found in the [project wiki](https://github.com/clarelgibson/england-school-admissions/wiki/Source-data).
## Custom functions
I have created several custom functions to assist with data prep. Load these functions by sourcing the `utils.R` script in the working directory.
```{r custom-functions}
# Custom functions
source("utils.R")
```
# Read data
## Info
This file contains descriptive fields relating to primary and secondary schools across all local authorities in England. It excludes nurseries, special schools, children's centers, pupil referral units and post-16 education. We can consider this largely to be the dimensional data for the dashboard, while the other datasets for Offers and Performance can be considered to be the measures.
The source data is stored in a CSV file on [Google Drive](https://drive.google.com/drive/u/1/folders/1lFZhobbGoCKKEtaZ5CDiphTbOxcbvJzW). We can read the CSV file directly from the source location using the custom function `read_csv_gdrive()`. When reading in source data, I usually assign it to a variable with the prefix `src_` to indicate that this is source data. Once I start to modify the data, I remove the prefix. That way, I always have a copy of the original source data in its unaltered form to refer back to.
```{r info-read}
# Read in the source data for info
src_info_path <- "https://drive.google.com/file/d/1kUqKvphHnh4M-NfP4gXLAr9MxdxoWgMT/view?usp=sharing"
src_info <- read_csv_gdrive(src_info_path)
```
This data frame has `r nrow(src_info)` rows and `r ncol(src_info)` columns. The columns include some identifiers for time, geography and school, some descriptive dimensions about each school and some measures relating to the number of pupils of different types. Each row represents a single school and the data represents the latest available.
## Offers
### By Local Authority
This file contains data relating to the number of applications from offers made to applicants for secondary and primary school places since 2014. The data is aggregated at the local authority level. This file will be important as our single source of truth for local authority-level measures (e.g. the number of preferences allowed on the application form and the total number of applications received).
```{r offla-read}
# Read in the source data for offla
src_offla_path <- "https://drive.google.com/file/d/12jONhUf3xSZdRbsUE_4tZRA014sIHO_D/view?usp=sharing"
src_offla <- read_csv_gdrive(src_offla_path)
```
### By School
This file contains data relating to the number of offers made to applicants for secondary and primary school places since 2014, and the proportion which received preferred offers. Data is aggregated at the school level. Further details including an explanation of the terminology used in this dataset can be found [here](https://drive.google.com/drive/u/1/folders/1bbnbd9d4HDn3yXIIhaRSigZV46KpCyKy). Once again, we can connect to it through Google Drive.
```{r offers-read}
# Read in the source data for offers
src_offers_path <- "https://drive.google.com/file/d/1bG0a0WRg-LEnASl6M3J0qyHJA_cMY1CQ/view?usp=sharing"
src_offers <- read_csv_gdrive(src_offers_path)
```
From the code outputs above, we can see that this data frame has `r nrow(src_offers)` rows and `r ncol(src_offers)` columns (a mix of character and numeric data types). The columns include some identifiers for time, geography and school, some descriptive dimensions for the number of preferences and type of school and some measures relating to the number of applications and offers made. Each row represents a single school for a single academic year.
## Performance
These files contain key stage 2 (KS2) performance data for primary schools and key stage 4 (KS4) performance data for secondary schools in England for academic years from 2014 onwards. Note that due to the COVID-19 pandemic, the UK Government cancelled all statutory national curriculum assessments due to be held in summer 2020 and 2021 at all Key Stages. Therefore, no data is available for the 2019/20 or 2020/21 academic years. Further details including an explanation of the terminology used in this dataset can be found [here](https://drive.google.com/drive/folders/1vai66CUaYhPI0RXQI-BlC5kufZZyf3JU?usp=sharing).
The source data for performance is stored in several CSV files on the Google Drive. Having read through the guidance notes for these files, for the measures I want to compare, I can find these in all years from 2015/16 onwards.
```{r performance-read, message=FALSE}
# Read in the performance data files from 2015/16 onwards
# 2015-16 KS2
src_perf_1516ks2_path <- "https://drive.google.com/file/d/1migtFWfMUr8HJDeujfw6h7p2nNDVsbp4/view?usp=sharing"
src_perf_1516ks2 <- read_csv_gdrive(src_perf_1516ks2_path)
# 2015-16 KS4
src_perf_1516ks4_path <- "https://drive.google.com/file/d/1ijl88VsaaNdEC-N2O9hcRKc2sixcMyES/view?usp=sharing"
src_perf_1516ks4 <- read_csv_gdrive(src_perf_1516ks4_path)
# 2016-17 KS2
src_perf_1617ks2_path <- "https://drive.google.com/file/d/1B1LpH8RXlBoH7C9mJPL4FribTvIUzekq/view?usp=sharing"
src_perf_1617ks2 <- read_csv_gdrive(src_perf_1617ks2_path)
# 2016-17 KS4
src_perf_1617ks4_path <- "https://drive.google.com/file/d/1jOOmuKXDDp--cbESCz-j4Vp7ufVK3tPV/view?usp=sharing"
src_perf_1617ks4 <- read_csv_gdrive(src_perf_1617ks4_path)
# 2017-18 KS2
src_perf_1718ks2_path <- "https://drive.google.com/file/d/1zJxcJrsOr7nvWwb6Yyg48QoMgSp76e8V/view?usp=sharing"
src_perf_1718ks2 <- read_csv_gdrive(src_perf_1718ks2_path)
# 2017-18 KS4
src_perf_1718ks4_path <- "https://drive.google.com/file/d/1-5WDJ5dZulWmqQa8fX5eczA0Bo2m_jEc/view?usp=sharing"
src_perf_1718ks4 <- read_csv_gdrive(src_perf_1718ks4_path)
# 2018-19 KS2
src_perf_1819ks2_path <- "https://drive.google.com/file/d/1CmXvDtDOmIlXXUzZ9SPW4Yoti-rTzi75/view?usp=sharing"
src_perf_1819ks2 <- read_csv_gdrive(src_perf_1819ks2_path)
# 2018-19 KS4
src_perf_1819ks4_path <- "https://drive.google.com/file/d/1uBhU0yeuV97d66PRR60pq8UaRWvdGIaA/view?usp=sharing"
src_perf_1819ks4 <- read_csv_gdrive(src_perf_1819ks4_path)
```
These files contain a large number of columns. We will need to pick out only the most relevant for reporting, which can be standardised across all schools.
## Phase
This dataset includes details about the educational phase. Since there are some schools in England that offer both primary and secondary education, we will need a way to distinguish the phases in the fact table.
```{r phase-read}
# Read in the source data for phase
src_phase_path <- "https://drive.google.com/file/d/1I9Ibo-hG_VzXvzNgOVMi6PPYfSCuJzgG/view?usp=sharing"
src_phase <- read_csv_gdrive(src_phase_path)
```
This dataset has `r nrow(src_phase)` rows and `r ncol(src_phase)` columns and contains largely descriptive information about the educational phases in England.
## Calendar
This dataset includes details about the academic years covered by the data.
```{r calendar-read}
# Read in the source data for calendar
src_calendar_path <- "https://drive.google.com/file/d/1el2cL02nQidzN1pMcZ7AUb3h6wXuzN_E/view?usp=sharing"
src_calendar <- read_csv_gdrive(src_calendar_path)
```
This will be our date table. Since the lowest unit of time in the data is year, this is the level of detail of the calendar table. This table has `r nrow(src_calendar)` rows and `r ncol(src_calendar)` columns.
# Clean data
## Info
One key piece of cleaning we need to do with this dataset is to provide a mechanism to group entities that describe the same establishment (i.e. where there has been a change in URN over time due to conversion to academy). We can do this using the `links` columns at the end of this data frame, along with the `urn`, to create a bridge table, which we'll call `brg_school`.
The links I am interested in are the predecessor-successor links so that I can keep schools grouped over time. I want the successor school to be retained as the master school and all predecessors to be linked to the master. If a school has no links then we simply list the master and linked URNs as the same number. Therefore, I need to keep all of the open schools as the master list and any predecessors need to be linked to those master URNs.
```{r bridge-school}
# Set up the bridge table
brg_school <- src_info %>%
# keep only currently open schools
filter(grepl("^Open.*", establishment_status_name)) %>%
# select required columns
select(master_urn = urn,
status = establishment_status_name,
starts_with("link_")) %>%
# pivot the links
pivot_longer(!c(master_urn, status),
values_drop_na = TRUE) %>%
# drop the name column
select(-name) %>%
# drop the non-useful link types
mutate(value = case_when(
value == "Does not have links" ~ value,
grepl(".*Predecessor.*", value) ~ value,
TRUE ~ "No useful links"
)) %>%
# split the description column into useful data
mutate(linked_urn = as.numeric(str_extract(value,"\\d+")),
link_type = str_trim(str_extract(value,"\\D+"))) %>%
# drop the value column
select(-value) %>%
# add self-urn
mutate(self_urn = master_urn) %>%
pivot_longer(cols = c(linked_urn, self_urn),
names_to = "urn_type",
values_to = "linked_urn",
values_drop_na = TRUE) %>%
# select required columns
select(master_urn,
linked_urn) %>%
distinct()
```
Now that the bridge table is set up I can filter the `src_info` table to include only the currently open schools. The number of distinct URNs in this table should then match the number of distinct master URNs in `brg_school`.
```{r info-filter}
# Keep only the currently open schools in info
info <- src_info %>%
filter(grepl("^Open.*",
establishment_status_name))
```
```{r info-urn-count-check}
# Check if the number of distinct URNs in info matches the number of
# distinct master URNs in brg_school
n_distinct(info$urn) == n_distinct(brg_school$master_urn)
```
The numbers match. Success!
Next, let's keep only the columns we need from `src_info`.
```{r info-select-cols}
# Keep only the required columns from src_info
info <- info %>%
select(urn:establishment_status_name,
open_date,
close_date,
phase_of_education_name:statutory_high_age,
gender_name,
religious_character_name,
admissions_policy_name,
school_capacity,
number_of_pupils:percentage_fsm,
ofsted_last_insp:head_preferred_job_title,
gor_name:lsoa_name,
ofsted_rating_name,
msoa_code:fsm)
```
Now we can rename the columns more appropriately.
```{r info-rename}
# Rename columns in info
info <- info %>%
rename(establishment_type = type_of_establishment_name,
establishment_type_group = establishment_type_group_name,
establishment_status = establishment_status_name,
education_phase = phase_of_education_name,
gender = gender_name,
religious_character = religious_character_name,
admissions_policy = admissions_policy_name,
county = county_name,
head_title = head_title_name,
gor = gor_name,
administrative_district = district_administrative_name,
administrative_ward = administrative_ward_name,
parliamentary_constituency = parliamentary_constituency_name,
urban_rural_indicator = urban_rural_name,
gss_la_code = gssla_code_name,
msoa = msoa_name,
lsoa = lsoa_name,
ofsted_rating = ofsted_rating_name,
number_of_fsm = fsm)
```
And convert the columns to the correct data types.
```{r info-type}
# Define the columns by type
info_date <- c("open_date",
"close_date",
"ofsted_last_insp",
"last_changed_date")
# Correct the data types in info
info <- info %>%
mutate(across(all_of(info_date), dmy)) %>%
mutate(telephone_num = as.character(telephone_num),
telephone_num = paste0("0", telephone_num))
```
Let's now review the selected column headings in this dataset and the number of unique values in each column
```{r info-unique-values}
# Print a list of the column headings in the info df
info %>%
summarise_all(n_distinct) %>%
transpose(keep.names = "field") %>%
rename(unique_values = V1)
```
The `la_name` and `la_code` fields need to be checked to ensure they are consistent. I refer to the [Office for National Statistics](https://www.ons.gov.uk) [Local Authority Districts (December 2021) Names and Codes in the United Kingdom](https://geoportal.statistics.gov.uk/documents/c4f647d8a4a648d7b4a1ebf057f8aaa3/about) document for the latest reference data for these fields. This document is included in the Google Drive [ref folder](https://drive.google.com/drive/folders/1lsRDMfjSzabSF8HVJdF8ex5JM6rDRlSo?usp=sharing).
Let's review the unique values of LA code and name in the `info` df.
```{r info-review-la-names}
# Review the unique values in la_name
info %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code, gss_la_code)
```
We can see that some of the `gss-la-code` values here are wrong. It looks like a value of `X999999` has been used where the true value could not be determined when the source data was put together. We need to replace this value with the correct value according to the reference table.
```{r info-clean-gss-code}
# Correct the erroneous gss_la_code fields
info <- info %>%
group_by(la_code, la_name) %>%
mutate(gss_la_code = min(gss_la_code)) %>%
ungroup()
# Check the results
info_la <- info %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code, gss_la_code)
```
Some columns have relatively few unique values, making these columns clearly categorical. I'd like to review the values to ensure they are consistent and well labelled.
```{r info-review-establishment-type}
# Review the unique values in establishment type
unique(info$establishment_type)
```
These values all look ok. [This page](https://www.leicester.gov.uk/schools-and-learning/school-and-colleges/school-admissions/understanding-the-different-types-of-school/) has some useful definitions of the different types of school.
```{r info-review-establishment-group}
# Review the unique values in establishment group
unique(info$establishment_type_group)
```
All ok.
```{r info-review-status}
# Review the unique values in establishment status
unique(info$establishment_status)
```
All ok.
```{r info-review-phase}
# Review the unique values in phase of education
unique(info$education_phase)
```
All ok.
```{r info-review-gender}
# Review the unique values in gender
unique(info$gender)
```
Some missing values here. Let's replace those with 'Not reported'.
```{r info-recode-gender}
# Convert missing values in gender
info$gender[is.na(info$gender)] <- "Not reported"
# Check the results
unique(info$gender)
```
All ok.
```{r info-review-religious-character}
# Review the unique values in religious character
unique(info$religious_character)
```
Again, let's fix the missing values.
```{r info-recode-religious-character}
# Convert missing values in religious character
info$religious_character[is.na(info$religious_character)] <-
"Not reported"
# Check the results
unique(info$religious_character)
```
All ok.
```{r info-review-admissions-policy}
# Review the unique values in admissions policy
unique(info$admissions_policy)
```
Again let's fix the missing values.
```{r info-recode-admissions-policy}
# Convert missing values in admissions policy
info$admissions_policy[is.na(info$admissions_policy)] <-
"Not reported"
# Check the results
unique(info$admissions_policy)
```
All ok.
```{r info-review-ofsted}
# Review the unique values in ofsted rating
unique(info$ofsted_rating)
```
Some recoding is necessary here. According to [Ofsted](https://www.gov.uk/government/publications/education-inspection-framework/education-inspection-framework), school inspections use a 4-point grading scale:
- grade 1 - outstanding
- grade 2 - good
- grade 3 - requires improvement
- grade 4 - inadequate
In our dataset, I see some other values of `Serious Weaknesses` and `Special Measures`, which are not on the ratings list provided by Ofsted. In fact, both of these ratings are a [subset](https://www.gov.uk/government/publications/school-inspections-a-guide-for-parents/school-inspections-a-guide-for-parents) of the "inadequate" (grade 4) rating. Therefore, I can recode both of these values as `inadequate`. I can also add in the grade points to the ratings dimension table to provide a quantitative and aggregatable measure.
```{r info-recode-ofsted}
# Fix values for ofsted rating
info <- info %>%
mutate(ofsted_rating = case_when(
grepl("Weakness", ofsted_rating) ~ "Inadequate",
grepl("Measures", ofsted_rating) ~ "Inadequate",
is.na(ofsted_rating) ~ "Not reported",
TRUE ~ ofsted_rating
),
ofsted_rating_score = case_when(
grepl("Outstanding", ofsted_rating) ~ 1,
grepl("Good", ofsted_rating) ~ 2,
grepl("improvement", ofsted_rating) ~ 3,
grepl("Inadequate", ofsted_rating) ~ 4
))
# Check results
unique(info$ofsted_rating)
unique(info$ofsted_rating_score)
```
## Offers
### By Local Authority
From the `src_offla` dataset we need to keep the identifiers for local authority, year and phase as well as the numerical data. We should filter the source data to keep only data at the `geographic_level` of "Local authority".
```{r offla-select-cols}
# Select the required columns from offla and correct data types
offla <- src_offla %>%
mutate(across(everything(), as.character)) %>%
filter(geographic_level == "Local authority") %>%
select(-c(nc_year_admission,
time_identifier,
geographic_level,
country_code,
country_name,
one_of_the_three_preference_offers,
ends_with("_percent"))) %>%
type_convert(na = "c")
```
Now we can rename the columns more appropriately.
```{r offla-rename}
# Rename the columns
offla <- offla %>%
rename(year = time_period,
gss_la_code = new_la_code,
la_code = old_la_code,
phase = school_phase,
number_of_preferences = no_of_preferences)
```
Now let's review the entries for the local authority identifiers and check which rows in `offla` are not present in `info_la` and therefore need to be changed.
```{r offla-la-names}
# Review the unique values of la_name
offla %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code) %>%
anti_join(info_la)
```
Let's make the necessary replacements.
```{r offla-recode-la}
# Recode the LA identifier fields in offla
offla <- offla %>%
mutate(
la_code = case_when(
la_code == 837 ~ 839,
la_code == 835 ~ 838,
la_code == 928 ~ 940,
la_code == 836 ~ 839,
TRUE ~ la_code
),
gss_la_code = case_when(
gss_la_code == "E06000028" ~ "E06000058",
gss_la_code == "E10000002" ~ "E06000060",
gss_la_code == "E10000009" ~ "E06000059",
gss_la_code == "E08000020" ~ "E08000037",
gss_la_code == "E10000021" ~ "E06000061",
gss_la_code == "E06000048" ~ "E06000057",
gss_la_code == "E06000029" ~ "E06000058",
TRUE ~ gss_la_code
),
la_name = case_when(
la_name == "Bournemouth" ~ "Bournemouth, Christchurch and Poole",
la_name == "Northamptonshire" ~ "North Northamptonshire",
la_name == "Poole" ~ "Bournemouth, Christchurch and Poole",
TRUE ~ la_name
)
)
# Extra rows because Northamptonshire is now two distinct regions
offla_extra <- tibble(
year = c(202122, 202122, 202021, 202021, 201920, 201920, 201819, 201819,
201718, 201718, 201617, 201617, 201516, 201516, 201415, 201415),
region_code = rep("E12000004", 16),
region_name = rep("East Midlands", 16),
gss_la_code = rep("E06000062", 16),
la_code = rep(941, 16),
la_name = rep("West Northamptonshire", 16),
phase = rep(c("Primary", "Secondary"), 8),
number_of_preferences = rep(3,16)
)
# Bind extra rows on to prefs
offla <- offla %>%
bind_rows(offla_extra) %>%
group_by_at(vars(-number_of_preferences)) %>%
slice_max(number_of_preferences) %>%
ungroup() %>%
distinct()
# Check the results
offla_la <- offla %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
offla_la %>%
anti_join(info_la)
```
```{r offla-review-year}
# Review the unique values of year
unique(offla$year)
```
These values should be recoded to match the `year_key` in the calendar table. We need to extract the academic start year, which is the first 4 digits of the value.
```{r offla-recode-year}
# Recode the values in year
offla$year <- as.numeric(str_extract(offla$year, "^\\d{4}"))
# Check the results
unique(offla$year)
```
### By School
Let's keep only the columns we need from `src_offers`. Some of the columns are duplicates of columns already in the `info` table, and some are just not required.
```{r offers-select-cols}
# Select required columns from offers
offers <- src_offers %>%
select(time_period,
region_code,
region_name,
old_la_code,
new_la_code,
la_name,
total_number_places_offered:offers_to_applicants_from_another_la,
denomination,
school_urn,
entry_year)
```
Next we can rename the columns more appropriately.
```{r offers-rename}
# Rename the columns
offers <- offers %>%
rename(year = time_period,
la_code = old_la_code,
gss_la_code = new_la_code,
religious_denomination = denomination,
urn = school_urn,
phase = entry_year)
```
Next we can filter out any records where the `school_urn` is `n/a` and convert the remaining values to numeric.
```{r offers-filter}
# Remove entries with null URN
offers <- offers %>%
filter(urn != "n/a") %>%
mutate(urn = as.numeric(urn))
```
Now we can check and correct the data types.
```{r offers-type}
# Define cols to be converted
offers_num <- c("proportion_1stprefs_v_1stprefoffers",
"proportion_1stprefs_v_totaloffers",
"offers_to_applicants_from_another_la")
# Correct data types in offers
offers <- offers %>%
mutate(across(all_of(offers_num), as.numeric))
```
Let's now review the selected column headings in this dataset and the number of unique values in each column.
```{r offers-unique-values}
# Print a list of the column headings in the offers df
offers %>%
summarise_all(n_distinct) %>%
transpose(keep.names = "field") %>%
rename(unique_values = V1)
```
Let's review the categorical values to ensure they are consistent and well labelled.
```{r offers-review-la-name}
# Review the unique values of la_name
offers %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
```
Some of these entries need to be corrected to match the values in `info`.
```{r offers-clean-la-name}
offers_la <- offers %>%
select(urn,
year,
la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct() %>%
arrange(urn,
-year,
la_code,
la_name) %>%
group_by(urn) %>%
slice_head(n = 1) %>%
ungroup() %>%
inner_join(brg_school,
by = c("urn" = "linked_urn")) %>%
left_join(select(info,
master_urn = urn,
master_la_code = la_code,
master_gss_la_code = gss_la_code,
master_la_name = la_name)) %>%
mutate(la_code = coalesce(master_la_code, la_code),
gss_la_code = coalesce(master_gss_la_code, gss_la_code),
la_name = coalesce(master_la_name, la_name)) %>%
select(urn,
master_urn,
la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct()
```
```{r offers-update-la-name}
offers <- offers %>%
select(-c(region_code,
region_name,
la_code,
gss_la_code,
la_name)) %>%
inner_join(offers_la)
# Check the results
offers_la <- offers %>%
select(la_code, gss_la_code, la_name) %>%
distinct() %>%
arrange(la_name, la_code)
```
```{r offers-review-year}
# Review the unique values of year
unique(offers$year)
```
These values should be recoded to match the `year_key` in the calendar table. We need to extract the academic start year, which is the first 4 digits of the value.
```{r offers-recode-year}
# Recode the values in year
offers$year <- as.numeric(str_extract(offers$year, "^\\d{4}"))
# Check the results
unique(offers$year)
```
All ok.
```{r offers-review-denomination}
# Review the unique values in denomination
unique(offers$religious_denomination)
```
There are some values of `n/a` here. Let's see what those are.
```{r offers-denomination-na}
# Review the n/a values in denomination
offers %>%
filter(religious_denomination == "n/a") %>%
head()
```
I guess this must be for schools who did not report their denomination. Let's replace `n/a` with `Not reported`.
```{r offers-denomination-replace-na}
# Replace n/a
offers$religious_denomination[offers$religious_denomination == "n/a"] <- "Not reported"
# Check the results
unique(offers$religious_denomination)
```
All ok.
```{r offers-review-phase}
# Review the unique values in phase
unique(offers$phase)
```
Here we need to make some changes so that this column can be used to refer to the educational phase of the data. We need to recode these values to match the phases in our phase table (primary or secondary). The value `R` corresponds to primary and `7` and `9` both correspond to secondary.
```{r offers-record-phase}
# Recode the phase values
offers <- offers %>%
mutate(phase = case_when(
phase == "R" ~ "Primary",
phase %in% c("7", "9") ~ "Secondary",
TRUE ~ phase
))
# Check the results
unique(offers$phase)
```
## Performance
From this dataset we really need to select one or two good measures of performance for KS2 and KS4, along with the URN so that we can join back to school.
- For KS2 we'll used the progress measures `readprog`, `writprog` and `matprog`. [This article](https://www.theschoolrun.com/understanding-primary-school-league-tables) has a good explanation of the KS2 progress scores, and [this article](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/851788/KS2_progress_banding_calculations_bandings_2019.pdf) explains how the bandings can be calculated.
- For KS4 we'll use the progress 8 and attainment 8 measures `p8mea` and `att8scr`. [This article](https://www.goodschoolsguide.co.uk/curricula-and-exams/progress-8-attainment-8) has a good explanation of these measures.
We will further need to filter both the KS2 and KS4 data to include only rows that are aggregated at the mainstream school level (`RECTYPE = 1`).
```{r perf-ks2-clean}
# Clean the KS2 data
# Bind rows from all years
perf_ks2 <- bind_rows(
"2015" = src_perf_1516ks2,
"2016" = src_perf_1617ks2,
"2017" = src_perf_1718ks2,
"2018" = src_perf_1819ks2,
.id = "year"
) %>%
# Filter to rectype == 1
filter(rectype == 1) %>%
# Select the required columns
select(year,
urn,
readprog,
writprog,
matprog) %>%
# convert to numeric (any strings can be converted to NA)
mutate(across(everything(), as.numeric)) %>%
# add a column for phase
mutate(phase = "Primary")
# View the head
head(perf_ks2)
```
```{r perf-ks4-clean}
# Bind rows from all years
perf_ks4 <- bind_rows(
"2015" = select(src_perf_1516ks4, rectype, urn, p8mea, att8scr),
"2016" = select(src_perf_1617ks4, rectype, urn, p8mea, att8scr),
"2017" = select(src_perf_1718ks4, rectype, urn, p8mea, att8scr),
"2018" = select(src_perf_1819ks4, rectype, urn, p8mea, att8scr),
.id = "year"
) %>%
# Filter to rectype == 1
filter(rectype == 1) %>%
# Select the required columns
select(year,
urn,
p8mea,
att8scr) %>%
# convert to numeric (any strings can be converted to NA)
mutate(across(everything(), as.numeric)) %>%
# add the phase
mutate(phase = "Secondary")
# View the head
head(perf_ks4)
```
Now we can join the KS2 and KS4 data together and recode the progress measure names.
```{r performance-combined}
# Combine two data frames
performance <- perf_ks2 %>%
full_join(perf_ks4) %>%
rename(reading_progress = readprog,
writing_progress = writprog,
maths_progress = matprog,
progress_8 = p8mea,
attainment_8 = att8scr) %>%
left_join(select(info,
urn,
la_code,
la_name))
# View the head
head(performance)
```
## Phase
No cleaning required. We can simply rename without the `src_` prefix.
```{r phase-clean}
# Rename the variable
phase <- src_phase
```
## Calendar
No cleaning required. We can simply rename without the `src_` prefix.
```{r calendar-clean}
# Rename the variable
calendar <- src_calendar
```
# Model data
The matrix below shows the fact and dimension tables that I intend to create for this model and the relationships between them.
| Fact/Dimension | dim_la | dim_school | dim_calendar | dim_phase |
|------------------|:------:|:----------:|:------------:|:---------:|
| fact_la_year | x | | x | x |
| fact_school_year | x | x | x | x |
| fact_school | x | x | | |
: Fact and dimension tables
To build out the dimensional model for this data, we need to review all of the columns we intend to use, so that we can determine where they fit in the model. The custom function `describe_df()` contained in the `utils.R` file can help us with this.
```{r describe-df}
# Create a vector of dfs to describe
data <- list("info" = info,
"offers_by_la" = offla,
"offers" = offers,
"performance" = performance,
"phase" = phase,
"calendar" = calendar)
# Run describe_df() over each df
t_data <- lapply(data, describe_df)
# Bind each df into a single df
t_data <- bind_rows(t_data, .id = "source_table")
```
After exporting the new `t_data` dataframe to a spreadsheet, I created a file named [star-schema](https://docs.google.com/spreadsheets/d/1rrlhlvFrnPpTQtn2YzWyJZzwlm-2wWAzDiEv6fKBoHU/edit?usp=sharing) which maps the raw source data to the corresponding dimension and fact tables for the star schema model.
```{r read-star-schema-planning}
# Read in star schema planning document
star_schema_path <- "https://docs.google.com/spreadsheets/d/1rrlhlvFrnPpTQtn2YzWyJZzwlm-2wWAzDiEv6fKBoHU/edit?usp=sharing"
star_schema <- read_sheet(star_schema_path,
col_types = "c")
```
## Dimensions
### Local Authority
The following columns are required for the local authority dimension table:
```{r dim-la-cols}
# Which columns are needed for the local authority dimension?
star_schema %>%
filter(model_table == "dim_la") %>%
kable()
```
These columns can all come from the `offers` and `info` data frames.
```{r dim-la-build}
# Build out the local authority dimension table
dim_la <- offers %>%
select(la_code,
gss_la_code,
la_name,
region_code,
region_name) %>%
distinct() %>%
full_join(info_la) %>%
mutate(region_code = case_when(la_code == 420 ~ "E12000009",
TRUE ~ region_code),
region_name = case_when(la_code == 420 ~ "South West",
TRUE ~ region_name))
# Check the result
head(dim_la)
```
Let's assign a unique key to each record, and add a record for the `null` case, which we'll call `Not reported`.
```{r dim-la-key}
# Specify the null value
dim_la_null <- tibble(la_key = 0,
la_code = 0,
la_name = "Not reported",
gss_la_code = "Not reported",
region_code = "Not reported",
region_name = "Not reported")
# Assign a key and row for null values
dim_la <- dim_la %>%
rowid_to_column(var = "la_key") %>%
bind_rows(dim_la_null) %>%
arrange(la_key)
```
### School
The following fields are required for the school dimension.
```{r dim-school-cols}
# Which columns are needed for the school dimension?
star_schema %>%
filter(model_table == "dim_school") %>%
kable()
```
These columns can all be found in the `info` and `offers` tables. We need to ensure that we include only the URNs in the `master_urn` column of our bridge table.
```{r dim-school-build}
# Define columns to include
dim_school_cols_info <- star_schema %>%
filter(model_table == "dim_school",
source_table == "info") %>%
pull(source_field)
dim_school_cols_offers <- star_schema %>%
filter(model_table == "dim_school",
source_table == "offers") %>%
pull(source_field)
# Define which rows from offers should be joined (as there can be more than
# one record per URN)
dim_school_offers <- offers %>%
select(all_of(dim_school_cols_offers),
year,
phase) %>%
distinct() %>%
group_by(urn) %>%
slice_max(year) %>%
slice_max(phase) %>%
ungroup()
# Build out the school dimension table
dim_school <- brg_school %>%
select(urn = master_urn) %>%
distinct() %>%
left_join(select(info,
all_of(dim_school_cols_info))) %>%
left_join(select(dim_school_offers,
all_of(dim_school_cols_offers)))
```
Let's assign a unique key to each record, and add a record for the `null` case, which we'll call `Not reported`.
```{r dim-school-key}
# Specify the null value
dim_school_null <- tibble(school_key = 0,
urn = 0,
establishment_number = 0,
establishment_name = "Not reported",
establishment_type = "Not reported",
establishment_type_group = "Not reported",
establishment_status = "Not reported",
open_date = NA,
close_date = NA,
education_phase = "Not reported",
statutory_low_age = NA,