-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPennsylvania Mathematics Assessments 2011-2012.rmd
514 lines (406 loc) · 19.3 KB
/
Pennsylvania Mathematics Assessments 2011-2012.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
EDA of Pennsylvania Mathematics Assessments 2011-2012 by Barry Davis
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
# Environment
library(tidyverse)
library(gridExtra)
library(psych)
library(scales)
library(ggthemes)
library(GGally)
knitr::opts_chunk$set(echo=FALSE)
```
```{r echo=FALSE, Load_the_Data}
#Get the data
math <- read.csv("data/math.csv", stringsAsFactors = FALSE)
```
# Overview
This report explores a dataset containing school district level data on the
academic achievement results of state assessments in mathematics. We will be
focusing on Pennsylvania.
[Source](https://catalog.data.gov/dataset/consolidated-state-performance-report-201112/resource/4da8f9de-d265-4e2f-b321-f180052df3f7)
# Univariate Plots Section
```{r echo=FALSE, Univariate_Plots}
# Limit data to PA
math.pa <- math %>% filter(STNAM=="PENNSYLVANIA")
# Str
str(math.pa)
```
We'll need to reshape this data to work with it. Many of the columns represent a
grade level and race (or collection of races) of the students. We will keep only
the records that represent all students, not divided by race. Then we will reshape
it for easier comparisons.
```{r warning=FALSE}
# Keep "ALL" as to only look at districts, not race.
math.pa <- math.pa %>% select(leanm11,starts_with("ALL"))
# Rename
names(math.pa) <- c("district","all_num","all_pct_m","3_num_m","3_pct_m",
"4_num_m","4_pct_m","5_num_m","5_pct_m","6_num_m","6_pct_m","7_num_m",
"7_pct_m","8_num_m","8_pct_m","hs_num_m","hs_pct_m")
# Lowercase for easier comparisons
math.pa$district <- tolower(math.pa$district)
# Split data to stitch together later
math.pa.percent <- math.pa[,c(1,3,5,7,9,11,13,15,17)]
math.pa.num <- math.pa[,c(1,2,4,6,8,10,12,14,16)]
# Rename columns for easy merging
names(math.pa.percent) <- c("district", "all", "third", "fourth", "fifth", "sixth",
"seventh", "eighth", "high school")
names(math.pa.num) <- c("district", "all", "third", "fourth", "fifth", "sixth",
"seventh", "eighth", "high school")
# Reshape the data
math.pa.percent <- gather(math.pa.percent, "grade", "percent", c(2:9))
math.pa.num <- gather(math.pa.num, "grade", "num_students", c(2:9))
# Merge everything together
pa.math <- merge(math.pa.percent, math.pa.num, by = c("district", "grade"), all.x=TRUE)
# Clean up some un-needed space
rm(math.pa.num, math.pa.percent)
# Get rid of identifiers
pa.math$percent <- gsub("GE", "", pa.math$percent)
pa.math$percent <- gsub("LE", "", pa.math$percent)
# Split up percent into high and low columns incase of ranges.
pa.math$low <- NA
pa.math$high <- NA
pa.math <- pa.math %>% separate(percent, c("low", "high"))
pa.math[pa.math$low=="",]$low <- NA
pa.math[is.na(pa.math$high),]$high <- pa.math[is.na(pa.math$high),]$low
pa.math$low <- as.numeric(pa.math$low)
pa.math$high <- as.numeric(pa.math$high)
# Add an average for the range
pa.math$avg_pct <- (pa.math$low + pa.math$high)/2
# Factors for grade levels
pa.math$grade <- as.factor(pa.math$grade)
# Summary
summary(pa.math)
```
The data was reshaped so we now have 3992 rows and 6 variables. Each observation
represents a grade level from a school district. We have the number of students
who were assessed, the low and high scores (for schools that report in ranges),
and the average score of the range.
```{r warning=FALSE}
# Distribution of low scores
ggplot(pa.math, aes(x=low)) +
geom_histogram(binwidth=5) +
scale_x_continuous(breaks=seq(0,100,10))
```
We look at the scores including the low value of those with ranges here. We can
see that the distribution is left skewed and peaks around 80%. It's also
notable how it steeply drops off just after 90%.
```{r warning=FALSE}
# Distribution of high scores
ggplot(pa.math, aes(x=high)) +
geom_histogram(binwidth=5) +
scale_x_continuous(breaks=seq(0,100,10))
```
These are the scores including the high value of those with ranges. I expected
the distribution to not be dramatically different than the previous plot. I do
see that the steep dropoff just after 90% is gone here due to using the higher
value of the ranges. We also peak much closer to 90% than previously.
```{r warning=FALSE}
# Distribution of average scores
ggplot(pa.math, aes(x=avg_pct)) +
geom_histogram(binwidth=5) +
scale_x_continuous(breaks=seq(0,100,10))
```
This is probably our best representation of the scores, as we are using the
individual scores plus the mean of any available ranges. We peak around 85%, but
see that steep dropoff again just after 90%.
```{r}
tapply(pa.math$avg_pct, pa.math$grade, summary)
```
These summarize the avg_pct variable by grade level. We can see that out high
school assessments are our worst, and our fourth grade assessments are our best.
The IQR of high school scores is very low too. It would be interesting to see
what other data sources could help explain this.
```{r warning=FALSE, message=FALSE}
# Distribution of number of students
ggplot(pa.math, aes(x=num_students)) +
geom_histogram()
```
We will need to remove some outliers from this.
```{r}
# Better bins
ggplot(pa.math, aes(x=num_students)) +
geom_histogram(binwidth=50) +
coord_cartesian(xlim=c(0,15000))
```
While we've zoomed in quite a bit, we still don't have a great representation
of our data. Let's apply a log10 scale.
```{r warning=FALSE}
# Better bins
ggplot(pa.math, aes(x=num_students)) +
geom_histogram() +
scale_x_continuous(trans = "log10",
breaks=trans_breaks("log10", function(x) 10^x),
labels=trans_format("log10", math_format(10^.x)))
```
Here's a better view of our distribution of the number of students in each grade
level for each district.
# Univariate Analysis
### What is the structure of your dataset?
There are `r unique(pa.math$district) %>% length()` school districts. Each district
has `r unique(pa.math$grade) %>% length()` grade levels that were tested, giving
us `r nrow(pa.math)` total rows or observations. The variable grade is a factor
with the following levels:
* `r levels(pa.math$grade)`
### What is/are the main feature(s) of interest in your dataset?
The main features here are average percent proficiency and grade.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
I believe district will be a factor in scores. I think we will see scores that
are consistent within a district.
### Did you create any new variables from existing variables in the dataset?
I created the grade, low, high, num_students, and avg_pct variables. The grade
and num_students were created by reshaping the data and condensing the columns
into row values. Some of these scores were ranges, so I split the range into a
low and high score, then calculated the range average for avg_pct.
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
The number of students distribution was skewed to the right as a few schools had
a very large number of students. I created a second plot of that distribution to
get a better look at the number of students in districts under 5000 students.
# Bivariate Plots Section
```{r echo=FALSE, Bivariate_Plots, message=FALSE, warning=FALSE}
# Scatterplot matrix
ggpairs(subset(pa.math, !is.na(low) & !is.na(high)) %>% select(-district))
```
Here we see the relationship between variables in the dataset. Our low, high,
and avg_pct are highly correlated, but that's because avg_pct is derived from
those values. Most of our data is not correlated - but this makes sense. We would
not expect grade level to have a strong relationship with scores.
```{r warning=FALSE}
# Do we see any pattern from 5000ft view?
ggplot(pa.math, aes(x=avg_pct, y=num_students)) +
geom_point(alpha=.5)
```
Let's remove the outlier and get a better look.
```{r}
# Same plot, removed outlier
ggplot(subset(pa.math, num_students < 60000), aes(x=avg_pct, y=num_students)) +
geom_point(alpha=1/20)
```
We need to remove more outliers and zoom in on the bulk of our data.
```{r warning=FALSE}
# Same plot, removed outlier
ggplot(subset(pa.math, num_students < 5000), aes(x=avg_pct, y=num_students)) +
geom_point(alpha=1/20, position = position_jitter(width = 2)) +
scale_y_continuous(breaks=seq(0,5000,250))
```
So it doesn't appear there's a correlation with number of students and average
proficiency. Let's take another closer look, but change our view.
```{r warning=FALSE}
# Size for high schools
ggplot(pa.math %>% filter(grade=="all", num_students<2000), aes(x=num_students, y=avg_pct)) +
geom_line()
```
We can clearly see here that no pattern emerges when looking at number of students
and the average proficiency.
```{r warning=FALSE, message=FALSE}
# Size for grades
p <- pa.math %>% group_by(grade) %>%
summarise(grade_avg = mean(avg_pct, na.rm=TRUE)) %>%
ungroup()
ggplot(p, aes(x=grade, y=grade_avg)) +
geom_col() +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all"))
```
Here we break down the average proficiency across each grade level for all
districts. It looks like our third, fourth, and seventh graders are the top 3
performers, followed closely by sixth grade. Again, we see the large dip in the
high school assessment scores.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
The set only has limited features, there was no variation.
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
There weren't many features outside of the main ones to be used.
### What was the strongest relationship you found?
There was no great relationship, so the strongest was number of students. However,
this was not a good indicator of proficiency.
# Multivariate Plots Section
```{r}
# Let's see the top 25% districts overall, and look for correlations in number
# of students and scores
p <- pa.math %>% filter(grade=="all", avg_pct > quantile(pa.math$avg_pct,
probs=.75, na.rm=TRUE))
p <- pa.math %>% filter(district %in% p$district)
ggplot(p, aes(x=district, y=avg_pct, size=num_students, color=grade)) +
geom_point(alpha=.5) +
theme(axis.text.y = element_text(size=7)) +
coord_flip()
```
This is the top 25% for overall proficiency average. The number of students, or
size of the school doesn't seem to correlate well with average proficiency for
the top schools.
```{r warning=FALSE}
# Let's see this by grade level, filtering down even further to less than 5000
# students
p <- pa.math %>% filter(num_students < 5000, !is.na(avg_pct))
ggplot(p, aes(x=avg_pct, y=num_students)) +
geom_point(alpha=1/10, position=position_jitter(width=2)) +
facet_wrap(~grade)
```
Here we take another look at the number of students and average proficiency and
break it down by grade level. There doesn't appear to be any strong relationship
here.
```{r}
# Random sample to look for any patterns by district and grade level
set.seed(1979)
sampleDistricts <- sample(pa.math$district, 20, replace=FALSE)
pa.math.samp <- pa.math %>% filter(district %in% sampleDistricts)
ggplot(pa.math.samp, aes(x=grade, y=avg_pct)) +
geom_col() +
facet_wrap(~district) +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all")) +
theme(axis.text.x = element_text(angle=90))
```
This looks at the average proficiency by grade, but split amongst a random
sample of 20 schools. I wanted to see if and patterns would emerge with random
selection.
```{r}
# What about the top 20 overall schools?
top20 <- pa.math %>% filter(grade=="all") %>%
arrange(desc(avg_pct)) %>% head(20)
p <- pa.math %>% filter(district %in% top20$district)
ggplot(p, aes(x=grade, y=avg_pct)) +
geom_col() +
facet_wrap(~district) +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all")) +
theme(axis.text.x = element_text(angle=90))
```
Since we did not see any patterns in our random sample, I take a look at the top
20 schools districts. We do see some consistency here throughout each grade level
within the districts. We also see a repeating pattern of the high school scores
being lower in almost all of the above.
```{r}
# Bottom 20 schools
bottom20 <- pa.math %>% filter(grade=="all") %>% arrange(avg_pct) %>% head(20)
p <- pa.math %>% filter(district %in% bottom20$district)
ggplot(p, aes(x=grade, y=avg_pct)) +
geom_col() +
facet_wrap(~district) +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all")) +
theme(axis.text.x = element_text(angle=90))
```
Just as there could be patterns in the top schools, it's possible there are
patterns in the bottom schools, so this plot looks at the bottom 20 districts.
We do not see any pattern throughout the grade levels of these poor performing
school districts.
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
The proficiency throughout each grade in a given district varies from school to
school. There seems to be a dip in the high school scores, which could be due to
a larger student base compared to each individual grade, the variety of math
being taught (and understood), and the length of time since using the math skills
that are assessed. We would need to know more about the assessment.
### Were there any interesting or surprising interactions between features?
No, these were fairly expected. I did think each district would have more consistency
from grade to grade. The top 20 did have very high scores across the board.
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
I did not create any models as it would not make sense to do so with this data.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
# Let's see the top 25% districts overall, and look for correlations in number
# of students and scores
p <- pa.math %>% filter(grade=="all", avg_pct > quantile(pa.math$avg_pct,
probs=.75, na.rm=TRUE))
p <- pa.math %>% filter(district %in% p$district)
ggplot(p, aes(x=district, y=avg_pct, size=num_students, color=grade)) +
geom_point(alpha=.5) +
theme_set(theme_grey()) +
theme(axis.text.y = element_text(size=7)) +
coord_flip() +
scale_color_brewer(palette = "Set1", name="Grade") +
scale_size(name="Number of Students") +
labs(title="Top 25% School Districts Average Proficiency",
subtitle="By Grade Level and Number of Students",
x="School DIstrict",y="Average Proficiency")
```
### Description One
There doesn't appear to be many patterns here. The high school scores are some
of the lowest and we see some grouping of fourth grade around the 95 percentile.
I used this because if a school has a high overall score, perhaps it can be
related to the number of students in their district.
### Plot Two
```{r echo=FALSE, Plot_Two, fig.width=9}
# What about the top 20 overall schools?
top20 <- pa.math %>% filter(grade=="all") %>%
arrange(desc(avg_pct)) %>% head(20)
p <- pa.math %>% filter(district %in% top20$district)
ggplot(p, aes(x=grade, y=avg_pct)) +
geom_col() +
facet_wrap(~district) +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all")) +
theme_set(theme_foundation()) +
theme(axis.text.x = element_text(angle=90)) +
labs(title="Top 20 Schools Average Proficiency by Grade",
x="Grade Level", y="Average Proficiency")
```
### Description Two
The averages across all grades in these top districts are fairly consistent.
We do see the drop off in scores in the high school grade levels for most of the
districts. I wanted to see the top schools and whether they were consistent
throughout each grade level. It does not appear that one grade level is skewing
any others. This can be for a number of reasons, like hiring good teachers,
calibrating teachers on what's being taught, or a large emphasis on taking
(and studying for) assessment tests.
### Plot Three
```{r echo=FALSE, Plot_Three, fig.width=9}
# Bottom 20 schools
bottom20 <- pa.math %>% filter(grade=="all") %>% arrange(avg_pct) %>% head(20)
p <- pa.math %>% filter(district %in% bottom20$district)
ggplot(p, aes(x=grade, y=avg_pct)) +
geom_col() +
facet_wrap(~district) +
scale_x_discrete(limits = c("third", "fourth", "fifth", "sixth", "seventh",
"eighth", "high school", "all")) +
theme_set(theme_foundation()) +
theme(axis.text.x = element_text(angle=90)) +
labs(title="Bottom 20 Schools Average Proficiency by Grade",
x="Grade Level", y="Average Proficiency")
```
### Description Three
The bottom schools do not have much consistency across grade levels with a few
exceptions. Wilkinsburg Borough has a very poor proficiency for high school. If
the top schools were consistent, I wondered if the bottom schools would show any
consistency. The scores here fluctuate from grade to grade, most of which remains
below a passing level.
------
# Reflection
The scores within a school district are often less consistent across grade levels
than I originally anticipated. As someone who grew up in Pennsylvania, it is a bit
disheartening to see how poor some of our school districts are in mathematics.
There was a lot of reshaping and cleaning that needed to be done for this dataset.
I stumbled on the two applications of gather() as well as figuring out how to
separate a field if I was only interested in one of the returned values.
Future work with this dataset can look at some other information such as taxes.
It would be interesting to see how each school district performs in relation to
how much school tax they collect. This would help us see what the residents get
for their hard earned money. We may be able to look at census data as well for
median household income to see if that plays a role. Information on Facebook
would be very interesting if paired with this data. For example, we could look at
current professions and see what the graduates of these schools are doing.
Would we see more STEM professions from the top districts? What about secondary
education? How many of the graduates go on to secondary education from the top
schools versus the bottom schools?