-
Notifications
You must be signed in to change notification settings - Fork 0
/
R Code 4 Analysis.R
727 lines (587 loc) · 58.1 KB
/
R Code 4 Analysis.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
Consistent with my approach, I initially planned to use the following control variables divided into 3 types to predict alumni median earnings after 10 years (MD_EARN_WNE_P10).
--------------------------------------------------------------
Table 1:Control Variables Used to Predict Alumni Outcomes
---------------------------------------------------------------
Student characteristics | Type of College | Location of College
--------------------------------------------------------------------------------|----------------------------------------------------------------|---------------------------------------
Enrollment of undergraduate students (UGDS) | Predominant undergraduate degree awarded (PREDDEG) | City (CITY)
Share of enrolled students who are White (UGDS_WHITE) | Number of Title IV students (NUM4__) | Region (REGION)
Share of enrolled students who are Black (UGDS_BLACK) | Average cost of attendance (COSTT_4A) | Locale (LOCALE)
Share of enrolled students who are Hispanic (UGDS_HISP) | Control of institution (CONTROL) |
Share of enrolled students who are Asian (UGDS_ASIAN) | Percentage of degrees awarded in several stem fields (PCIP) |
Share of enrolled students who are non-resident aliens (UGDS_NRA) | Instructional expenditures per full-time equivalent student |
Share of enrolled students who are Women (UGDS_WOMEN) | Average faculty salary (AVGFACSAL) |
Average SAT score of admitted students (SAT_AVG_ALL) | Average net price for Title IV institutions (NPT4__) |
Completion rate for first-time full-time students (C150_) | Average net price for Title IV institutions(NPT4) |
Share of undergraduates who are first-time full-time (PFTFTUG1_EF) | Number of Title IV students (NUM4) |
Share of students who received a federal loan while in school (LOAN_EVER) | |
Share of students who received a pell grant while in school (PELL_EVER) | |
Fraction of repayment cohort who are not in default, and with loan balances that have declined one year since entering repayment(RPY_IYR_RT) |
More information on these variables can be found in the [Full Data documentation](https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf). The [Modified DataDictionary](https://github.com/DAboaba/An-Exploration-of-Value-Added-Methodology/blob/master/Modified%20DataDictionary.xlsx), should also be consulted to get a sense of what these variables are called in the dataset.
# Importation, Preparation, and Cleaning
## Importation
I was interested in exploring the mentioned issue for the 2012/2013 academic year because it appeared to be the year with the most data available. Thus, I needed to import [this](https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data.zip) dataset.
Before beginning, install the follwing packages.
```{r Set-up, message = FALSE, warning = FALSE, tidy = TRUE}
library(readr) #Install the readr package
library(dplyr) #Install the dplyr package
library(tidyr) #Install the tidyr package
library(ggplot2) #Install the ggplot2 package
library(purrr) #Install the purrr package
```
To import the dataset;
```{r Main importation, message = FALSE, warning = FALSE, tidy = TRUE}
setwd("/Users/DMA/BOX\ STUFF/Columbia/Semester\ 2/Programming\ in\ R/Toolbox/CollegeScorecard_Raw_Data") #Set the working directory to wherever you have stored the downloaded data.
CS2012_13 <- read_csv("MERGED2012_13_PP.csv") #Use the read_csv() function from the readr package to import the dataset into R.
```
## Preparation
### Preparation Part 1
After importing the dataset, three main issues became apparent.
* The dataset contained far more variables than my question required.
+ This would have made it difficult to understand how to analyze the data.
* Looking at the cohort map some of the variables I would like to include were included in the wrong dataset. For instance, some variables for 2012/2013 were put into the 2013/2014 dataset.
+ This would have significantly limited the explanatory power of my analysis either because I had used past data, or I had used only a few variables.
* The dataset was missing some important variables, such as CONTROL.
+ This would have significantly limited my ability to focus on subsets of the dataset that I believed were important.
To address the issue of the large number of variables in the dataset, I removed variables I was not interested in studying. This step also partially addressed the second issue because I also took out variables that were incorrectly assigned to the 2012/2013 dataset rather than the 2011/2012 dataset.
```{r Issue 1, message = FALSE, warning = FALSE, tidy = TRUE}
CS2012_13 <- select(CS2012_13, OPEID, OPEID6, INSTNM, MAIN, NUMBRANCH, CONTROL, REGION, SAT_AVG_ALL, AVGFACSAL, UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, UGDS_NRA, UGDS_WOMEN, PPTUG_EF, C150_4, C150_L4, PFTFTUG1_EF, RPY_1YR_RT, LOAN_EVER, PELL_EVER, MD_EARN_WNE_P10) #Dropped unnecessary variables
```
To finish addressing the second issue, and address the third issue, I imported the 2013/2014 and 2014/2015 datasets, kept the variables I was interested in and joined these datasets to the CS2012 dataset
```{r Issues 2 & 3, message = FALSE, warning = FALSE, tidy = TRUE}
setwd("/Users/DMA/BOX\ STUFF/Columbia/Semester\ 2/Programming\ in\ R/Toolbox/CollegeScorecard_Raw_Data") #Set the working directory to wherever you have stored the downloaded data.
#Use the read_csv() function from the readr package to import the 2013/2014 and 2014/2015 datasets into R.
CS2013_14 <- read_csv("MERGED2013_14_PP.csv")
CS2014_15 <- read_csv("MERGED2014_15_PP.csv")
#Dropped unnecessary variables from both datasets.
CS2013_14 <- select(CS2013_14, OPEID, INSTNM, PREDDEG, TUITFTE, INEXPFTE, PCIP01, PCIP03, PCIP04, PCIP10, PCIP11, PCIP14, PCIP15, PCIP26, PCIP27, PCIP29, PCIP40, PCIP41, NPT4_PUB, NPT4_PRIV, NUM4_PUB, NUM4_PRIV, COSTT4_A, PCTPELL)
CS2014_15 <- select(CS2014_15, OPEID, INSTNM, CURROPER, LOCALE, HBCU, PBI, ANNHI, TRIBAL, AANAPII, HSI, NANTI, MENONLY, WOMENONLY)
#Joined the three datasets using the OPEID variable. Because only two can be joined at a time, joined the 2012/2013 datasets together then joined the result of that operation to the 2014/2015 dataset.
CS2012 <- left_join(left_join(CS2012_13, CS2013_14, by = c("OPEID", "INSTNM")), CS2014_15, by = c("OPEID", "INSTNM"))
"OPEID is the identification number used by the U.S. Department of Education to identify institutions. This is a an 8-digit number that distinguishes between institutions, branches, additional locations, and other entities that are part of the eligible institution"
#Remove extra datasets
rm(CS2012_13, CS2013_14, CS2014_15)
```
#### Preparation Part 2
The rationale behind importing data from the 2014/2015 dataset was to use several variables as a filter. In particular, I did not want to include colleges that were not currently operating. Additionally, colleges with multiple campuses posed a problem because it would needlessly complicate my research. Therefore, I removed these colleges and then removed the variables I had used for the filtering.
```{r Filtering out colleges,message = FALSE, warning = FALSE, tidy = TRUE}
CS2012 <- CS2012 %>% filter(CURROPER == 1, MAIN == 1, NUMBRANCH == 1) #Keep colleges that are currently operating and are main campuses with only one branch.
CS2012 <- select(CS2012, -c(OPEID, OPEID6, MAIN, NUMBRANCH, CURROPER)) #Remove variables that were used to filter out institutions.
```
## Cleaning
### Cleaning Part 1
At this point the dataset has been adequately prepared. Consequently, I began to clean the dataset for final analysis. Several problems were obvious and I began working to address and mitigate these issues.
The main issue that needed to be addressed was that the type of many of the variables was wrong. Because in the original dataset, missing data was recorded as "NULL", when the data was imported these variables were stored as character types. Thus, there were twp problems I had to address, incorrect types and NULL. At the same time, some of the variables I had were still not fully representative of what I wanted.
In a previous project, I excluded any institutions without values for any of the variables I was interested in. However, this dramatically reduced the number of institutions in the dataset. In another project, I decided to fill missing values with the median of that variable based on the CONTROL value of the institution .i.e. for institutions with a CONTROL value of 1 the median of this group was calculated and used to fill the value for institutuions with missing values for that group. However, that also biased the dataset. This time, I decided to use a more formal approach.
First, I needed to convert all necessary columns to numeric.
```{r Cleaning 1: Numeric Transformation, message = FALSE, warning = FALSE, tidy = TRUE}
#Convert all neccessary columns to numeric
##Some columns needed to be prepared beforehand
###The RPY_1YR_RT, LOAN_EVER, PELL_EVER, and MD_EARN_WNE variables had "PRIVACYSUPPRESSED" for some of the data.
CS2012 <- CS2012 %>% filter(RPY_1YR_RT!="PrivacySuppressed", LOAN_EVER!="PrivacySuppressed", PELL_EVER!="PrivacySuppressed", MD_EARN_WNE_P10!="PrivacySuppressed") #filter out suppressed data
#Numeric Transformation
CS2012col <- CS2012 %>% select(-c(INSTNM, CONTROL, REGION)) #Create dataset with only columns for numeric transformation
CS2012col <- (map(CS2012col, as.numeric)) #Apply the as.numeric function over every column of the dataset
CS2012 <- cbind((CS2012 %>% select(INSTNM, CONTROL, REGION)), CS2012col) #Bind together the identifier columns and data columns and reassign to CS2012
CS2012 <- as_data_frame(CS2012) #To aid ease of manipulation converted the data frame into a tibble.
CS2012 <- unique(CS2012) #Deleting duplicated rows
```
I needed to combine the C150, NPT4, and NUM4 variables. They were separated based on the type of institution. If the institution was public it would have data for NPT4_PRIV but not for NPT4_PUB. I combine these variables below. I also excluded universities with 0 for any of these variables since that simply did not make sense.
```{r Cleaning 2, message = FALSE, warning = FALSE, tidy = TRUE}
repmiss_combcol <- function (df, var1, var2, replacement) {
df[[var1]][is.na(df[[var1]])] <- replacement
df[[var2]][is.na(df[[var2]])] <- replacement
df[[var1]] + df[[var2]]
}
#C150, NPT4, NUM4
CS2012[["C150"]] <- repmiss_combcol(df = CS2012, "C150_4", "C150_L4", replacement = 0)
CS2012[["NPT4"]] <- repmiss_combcol(df = CS2012, "NPT4_PRIV", "NPT4_PUB", replacement = 0)
CS2012[["NUM4"]] <- repmiss_combcol(df = CS2012, "NUM4_PRIV", "NUM4_PUB", replacement = 0)
CS2012 <- select(CS2012, -c(C150_L4, C150_4, NPT4_PUB, NPT4_PRIV, NUM4_PUB, NUM4_PRIV)) #Removing variables that are now unnecessary
# About 200 universities had very little data so they were removed. Will figure out what to do with them later
Excluded <- filter(CS2012, C150 == 0 | NPT4 == 0 | NUM4 == 0 | is.na(REGION))
CS2012 <- filter(CS2012, C150 != 0 | NPT4 != 0 | NUM4 != 0, !is.na(REGION))
```
The "PICPxy" variables, where x and y are numbers ranging from 0 to 9, represent the percentage of degrees an institution awarded in several STEM fields. These variables were taken from a larger list of degrees an institution awarded in several fields. I used this [STEM Degree list](http://stemdegreelist.com) to narrow down the variables. Because I was interested in the total percentage of degrees an institution awarded in STEM generally. I added up the percentages and created a new variable.
```{r Cleaning 4, message = FALSE, warning = FALSE, tidy = TRUE}
#Creating a new combination variable
CS2012 <- CS2012 %>% mutate(PCIPSTEM = PCIP01 + PCIP03 + PCIP04 + PCIP10 + PCIP11 + PCIP14 + PCIP15 + PCIP26 + PCIP27 + PCIP29 + PCIP40 + PCIP41)
#Removing the old variables
CS2012 <- select(CS2012, -c(PCIP01, PCIP03, PCIP04, PCIP10, PCIP11, PCIP14, PCIP15, PCIP26, PCIP27, PCIP29, PCIP40, PCIP41))
```
I am not interested in the specific mission of the school, just whether it serves minorities. I added up the dummy variables (HBCU, PBI, ANNHI, TRIBAL, AANAPII, HSI, NANTI, MENONLY, WOMENOLY) and created a new variable (MINORITY).
```{r Cleaning 4b, message = FALSE, warning = FALSE, tidy = TRUE}
#Creating a new combination variable
CS2012 <- CS2012 %>% mutate(MINORITY = HBCU + PBI + ANNHI + TRIBAL + AANAPII + HSI + NANTI + MENONLY + WOMENONLY)
#Removing the old variables
CS2012 <- select(CS2012, -c(HBCU, PBI, ANNHI, TRIBAL, AANAPII, HSI, NANTI, MENONLY, WOMENONLY))
```
After this, I filled missing data using a Formal Imputation Methodology.
```{r Formal Imputation Methodology: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
#Gelman argues that a good approach for situations like this (where multiple variables have missing values) is to fit a regression to the observed cases and then use that to predict the missing cases. However, if we were to use the resulting deterministic imputations, we would be falsely implying that most of these nonrespondents had values in the middle of the range. To fix this we can add the prediction error into the regression to put the uncertainty back into the imputations. Several variables had no missing values MINORITY, NUM4, NPT4, C150, LOCALE, PREDDEG, REGION, CONTROL, INSTNM
#With the exception of INSTNM, these variables were used to predict the following variables. As each variable was predicted and completed, it was added to the regression to help predict the next variable. I imputed values for those with the least amount of missing values first. Thus, the variables with the most amount of missing values were predicted using the most data available at that point.
#The dataset was filled based on the group to which the institution belonged. To make this simple, I first split the data into 3 groups based on CONTROL for Interpolation.
CS2012C1 <- CS2012 %>% filter(CONTROL == 1)
CS2012C2 <- CS2012 %>% filter(CONTROL == 2)
CS2012C3 <- CS2012 %>% filter(CONTROL == 3)
```
```{r Formal Imputation C1: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
attach(CS2012C1) #Attaching the dataset so we can easily call on variables
impute <- function(a, a.impute){
ifelse(is.na(a), a.impute, a)
}
#For TUITFTE
lm.imp.TUITFTE <- lm(TUITFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY, data = CS2012C1)
pred.TUITFTE <- rnorm(1, predict(lm.imp.TUITFTE, CS2012C1), sigma(lm.imp.TUITFTE)) #where the number of missing observations is the first number
TUITFTE.imp <- impute(TUITFTE, pred.TUITFTE)
CS2012C1$TUITFTE <- TUITFTE.imp
#For INEXPFTE
lm.imp.INEXPFTE <- lm(INEXPFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE, data = CS2012C1)
pred.INEXPFTE <- rnorm(1, predict(lm.imp.INEXPFTE, CS2012C1), sigma(lm.imp.INEXPFTE))
INEXPFTE.imp <- impute(INEXPFTE, pred.INEXPFTE)
CS2012C1$INEXPFTE <- INEXPFTE.imp
#For PELL_EVER
lm.imp.PELL_EVER <- lm(PELL_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE, data = CS2012C1)
pred.PELL_EVER <- rnorm(1, predict(lm.imp.PELL_EVER, CS2012C1), sigma(lm.imp.PELL_EVER))
PELL_EVER.imp <- impute(PELL_EVER, pred.PELL_EVER)
CS2012C1$PELL_EVER <- PELL_EVER.imp
#For LOAN_EVER
lm.imp.LOAN_EVER <- lm(LOAN_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER, data = CS2012C1)
pred.LOAN_EVER <- rnorm(1, predict(lm.imp.LOAN_EVER, CS2012C1), sigma(lm.imp.LOAN_EVER))
LOAN_EVER.imp <- impute(LOAN_EVER, pred.LOAN_EVER)
CS2012C1$LOAN_EVER <- LOAN_EVER.imp
#For UGDS
lm.imp.UGDS <- lm(UGDS ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER, data = CS2012C1)
pred.UGDS <- rnorm(3, predict(lm.imp.UGDS, CS2012C1), sigma(lm.imp.UGDS))
UGDS.imp <- impute(UGDS, pred.UGDS)
CS2012C1$UGDS <- UGDS.imp
#For UGDS_ASIAN
lm.imp.UGDS_ASIAN <- lm(UGDS_ASIAN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS, data = CS2012C1)
pred.UGDS_ASIAN <- rnorm(3, predict(lm.imp.UGDS_ASIAN, CS2012C1), sigma(lm.imp.UGDS_ASIAN))
UGDS_ASIAN.imp <- impute(UGDS_ASIAN, pred.UGDS_ASIAN)
CS2012C1$UGDS_ASIAN <- UGDS_ASIAN.imp
#For UGDS_BLACK
lm.imp.UGDS_BLACK <- lm(UGDS_BLACK ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN, data = CS2012C1)
pred.UGDS_BLACK <- rnorm(3, predict(lm.imp.UGDS_BLACK, CS2012C1), sigma(lm.imp.UGDS_BLACK))
UGDS_BLACK.imp <- impute(UGDS_BLACK, pred.UGDS_BLACK)
CS2012C1$UGDS_BLACK <- UGDS_BLACK.imp
#For UGDS_HISP
lm.imp.UGDS_HISP <- lm(UGDS_HISP ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK, data = CS2012C1)
pred.UGDS_HISP <- rnorm(3, predict(lm.imp.UGDS_HISP, CS2012C1), sigma(lm.imp.UGDS_HISP))
UGDS_HISP.imp <- impute(UGDS_HISP, pred.UGDS_HISP)
CS2012C1$UGDS_HISP <- UGDS_HISP.imp
#For UGDS_NRA
lm.imp.UGDS_NRA <- lm(UGDS_NRA ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP, data = CS2012C1)
pred.UGDS_NRA <- rnorm(3, predict(lm.imp.UGDS_NRA, CS2012C1), sigma(lm.imp.UGDS_NRA))
UGDS_NRA.imp <- impute(UGDS_NRA, pred.UGDS_NRA)
CS2012C1$UGDS_NRA <- UGDS_NRA.imp
#For UGDS_WHITE
lm.imp.UGDS_WHITE <- lm(UGDS_WHITE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA, data = CS2012C1)
pred.UGDS_WHITE <- rnorm(3, predict(lm.imp.UGDS_WHITE, CS2012C1), sigma(lm.imp.UGDS_WHITE))
UGDS_WHITE.imp <- impute(UGDS_WHITE, pred.UGDS_WHITE)
CS2012C1$UGDS_WHITE <- UGDS_WHITE.imp
#For UGDS_WOMEN
lm.imp.UGDS_WOMEN <- lm(UGDS_WOMEN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE, data = CS2012C1)
pred.UGDS_WOMEN <- rnorm(3, predict(lm.imp.UGDS_WOMEN, CS2012C1), sigma(lm.imp.UGDS_WOMEN))
UGDS_WOMEN.imp <- impute(UGDS_WOMEN, pred.UGDS_WOMEN)
CS2012C1$UGDS_WOMEN <- UGDS_WOMEN.imp
#For PCIPSTEM
lm.imp.PCIPSTEM <- lm(PCIPSTEM ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN, data = CS2012C1)
pred.PCIPSTEM <- rnorm(3, predict(lm.imp.PCIPSTEM, CS2012C1), sigma(lm.imp.PCIPSTEM))
PCIPSTEM.imp <- impute(PCIPSTEM, pred.PCIPSTEM)
CS2012C1$PCIPSTEM <- PCIPSTEM.imp
#For PPTUG_EF
lm.imp.PPTUG_EF <- lm(PPTUG_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM, data = CS2012C1)
pred.PPTUG_EF <- rnorm(3, predict(lm.imp.PPTUG_EF, CS2012C1), sigma(lm.imp.PPTUG_EF))
PPTUG_EF.imp <- impute(PPTUG_EF, pred.PPTUG_EF)
CS2012C1$PPTUG_EF <- PPTUG_EF.imp
#For PCTPELL
lm.imp.PCTPELL <- lm(PCTPELL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF, data = CS2012)
pred.PCTPELL <- rnorm(4, predict(lm.imp.PCTPELL, CS2012C1), sigma(lm.imp.PCTPELL))
PCTPELL.imp <- impute(PCTPELL, pred.PCTPELL)
CS2012C1$PCTPELL <- PCTPELL.imp
#For RPY_1YR_RT
lm.imp.RPY_1YR_RT <- lm(RPY_1YR_RT ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL, data = CS2012C1)
pred.RPY_1YR_RT <- rnorm(22, predict(lm.imp.RPY_1YR_RT, CS2012C1), sigma(lm.imp.RPY_1YR_RT))
RPY_1YR_RT.imp <- impute(RPY_1YR_RT, pred.RPY_1YR_RT)
CS2012C1$RPY_1YR_RT <- RPY_1YR_RT.imp
#For AVGFACSAL*3 missing
lm.imp.AVGFACSAL <- lm(AVGFACSAL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT, data = CS2012C1)
pred.AVGFACSAL <- rnorm(483, predict(lm.imp.AVGFACSAL, CS2012C1), sigma(lm.imp.AVGFACSAL))
AVGFACSAL.imp <- impute(AVGFACSAL, pred.AVGFACSAL)
CS2012C1$AVGFACSAL <- AVGFACSAL.imp
#For COSTT4_A*89
lm.imp.COSTT4_A <- lm(COSTT4_A ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL, data = CS2012C1)
pred.COSTT4_A <- rnorm(511, predict(lm.imp.COSTT4_A, CS2012C1), sigma(lm.imp.COSTT4_A))
COSTT4_A.imp <- impute(COSTT4_A, pred.COSTT4_A)
CS2012C1$COSTT4_A <- COSTT4_A.imp
#For PFTFTUG1_EF
lm.imp.PFTFTUG1_EF <- lm(PFTFTUG1_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A, data = CS2012C1)
pred.PFTFTUG1_EF <- rnorm(566, predict(lm.imp.PFTFTUG1_EF, CS2012C1), sigma(lm.imp.PFTFTUG1_EF))
PFTFTUG1_EF.imp <- impute(PFTFTUG1_EF, pred.PFTFTUG1_EF)
CS2012C1$PFTFTUG1_EF <- PFTFTUG1_EF.imp
#For SAT_AVG_ALL
lm.imp.SAT_AVG_ALL <- lm(SAT_AVG_ALL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF, data = CS2012C1)
pred.SAT_AVG_ALL <- rnorm(1644, predict(lm.imp.SAT_AVG_ALL, CS2012C1), sigma(lm.imp.SAT_AVG_ALL))
SAT_AVG_ALL.imp <- impute(SAT_AVG_ALL, pred.SAT_AVG_ALL)
CS2012C1$SAT_AVG_ALL <- SAT_AVG_ALL.imp
#For MD_EARN_WNE_P10
lm.imp.MD_EARN_WNE_P10 <- lm(MD_EARN_WNE_P10 ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF + SAT_AVG_ALL, data = CS2012C1)
pred.MD_EARN_WNE_P10 <- rnorm(161, predict(lm.imp.MD_EARN_WNE_P10, CS2012C1), sigma(lm.imp.MD_EARN_WNE_P10))
MD_EARN_WNE_P10.imp <- impute(MD_EARN_WNE_P10, pred.MD_EARN_WNE_P10)
CS2012C1$MD_EARN_WNE_P10 <- MD_EARN_WNE_P10.imp
```
```{r Formal Imputation C2: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
attach(CS2012C2) #Attaching the dataset so we can easily call on variables
impute <- function(a, a.impute){
ifelse(is.na(a), a.impute, a)
}
#For TUITFTE
lm.imp.TUITFTE <- lm(TUITFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY, data = CS2012C2)
pred.TUITFTE <- rnorm(1, predict(lm.imp.TUITFTE, CS2012C2), sigma(lm.imp.TUITFTE)) #where the number of missing observations is the first number
TUITFTE.imp <- impute(TUITFTE, pred.TUITFTE)
CS2012C2$TUITFTE <- TUITFTE.imp
#For INEXPFTE
lm.imp.INEXPFTE <- lm(INEXPFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE, data = CS2012C2)
pred.INEXPFTE <- rnorm(1, predict(lm.imp.INEXPFTE, CS2012C2), sigma(lm.imp.INEXPFTE))
INEXPFTE.imp <- impute(INEXPFTE, pred.INEXPFTE)
CS2012C2$INEXPFTE <- INEXPFTE.imp
#For PELL_EVER
lm.imp.PELL_EVER <- lm(PELL_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE, data = CS2012C2)
pred.PELL_EVER <- rnorm(1, predict(lm.imp.PELL_EVER, CS2012C2), sigma(lm.imp.PELL_EVER))
PELL_EVER.imp <- impute(PELL_EVER, pred.PELL_EVER)
CS2012C2$PELL_EVER <- PELL_EVER.imp
#For LOAN_EVER
lm.imp.LOAN_EVER <- lm(LOAN_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER, data = CS2012C2)
pred.LOAN_EVER <- rnorm(1, predict(lm.imp.LOAN_EVER, CS2012C2), sigma(lm.imp.LOAN_EVER))
LOAN_EVER.imp <- impute(LOAN_EVER, pred.LOAN_EVER)
CS2012C2$LOAN_EVER <- LOAN_EVER.imp
#For UGDS
lm.imp.UGDS <- lm(UGDS ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER, data = CS2012C2)
pred.UGDS <- rnorm(3, predict(lm.imp.UGDS, CS2012C2), sigma(lm.imp.UGDS))
UGDS.imp <- impute(UGDS, pred.UGDS)
CS2012C2$UGDS <- UGDS.imp
#For UGDS_ASIAN
lm.imp.UGDS_ASIAN <- lm(UGDS_ASIAN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS, data = CS2012C2)
pred.UGDS_ASIAN <- rnorm(3, predict(lm.imp.UGDS_ASIAN, CS2012C2), sigma(lm.imp.UGDS_ASIAN))
UGDS_ASIAN.imp <- impute(UGDS_ASIAN, pred.UGDS_ASIAN)
CS2012C2$UGDS_ASIAN <- UGDS_ASIAN.imp
#For UGDS_BLACK
lm.imp.UGDS_BLACK <- lm(UGDS_BLACK ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN, data = CS2012C2)
pred.UGDS_BLACK <- rnorm(3, predict(lm.imp.UGDS_BLACK, CS2012C2), sigma(lm.imp.UGDS_BLACK))
UGDS_BLACK.imp <- impute(UGDS_BLACK, pred.UGDS_BLACK)
CS2012C2$UGDS_BLACK <- UGDS_BLACK.imp
#For UGDS_HISP
lm.imp.UGDS_HISP <- lm(UGDS_HISP ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK, data = CS2012C2)
pred.UGDS_HISP <- rnorm(3, predict(lm.imp.UGDS_HISP, CS2012C2), sigma(lm.imp.UGDS_HISP))
UGDS_HISP.imp <- impute(UGDS_HISP, pred.UGDS_HISP)
CS2012C2$UGDS_HISP <- UGDS_HISP.imp
#For UGDS_NRA
lm.imp.UGDS_NRA <- lm(UGDS_NRA ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP, data = CS2012C2)
pred.UGDS_NRA <- rnorm(3, predict(lm.imp.UGDS_NRA, CS2012C2), sigma(lm.imp.UGDS_NRA))
UGDS_NRA.imp <- impute(UGDS_NRA, pred.UGDS_NRA)
CS2012C2$UGDS_NRA <- UGDS_NRA.imp
#For UGDS_WHITE
lm.imp.UGDS_WHITE <- lm(UGDS_WHITE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA, data = CS2012C2)
pred.UGDS_WHITE <- rnorm(3, predict(lm.imp.UGDS_WHITE, CS2012C2), sigma(lm.imp.UGDS_WHITE))
UGDS_WHITE.imp <- impute(UGDS_WHITE, pred.UGDS_WHITE)
CS2012C2$UGDS_WHITE <- UGDS_WHITE.imp
#For UGDS_WOMEN
lm.imp.UGDS_WOMEN <- lm(UGDS_WOMEN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE, data = CS2012C2)
pred.UGDS_WOMEN <- rnorm(3, predict(lm.imp.UGDS_WOMEN, CS2012C2), sigma(lm.imp.UGDS_WOMEN))
UGDS_WOMEN.imp <- impute(UGDS_WOMEN, pred.UGDS_WOMEN)
CS2012C2$UGDS_WOMEN <- UGDS_WOMEN.imp
#For PCIPSTEM
lm.imp.PCIPSTEM <- lm(PCIPSTEM ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN, data = CS2012C2)
pred.PCIPSTEM <- rnorm(3, predict(lm.imp.PCIPSTEM, CS2012C2), sigma(lm.imp.PCIPSTEM))
PCIPSTEM.imp <- impute(PCIPSTEM, pred.PCIPSTEM)
CS2012C2$PCIPSTEM <- PCIPSTEM.imp
#For PPTUG_EF
lm.imp.PPTUG_EF <- lm(PPTUG_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM, data = CS2012C2)
pred.PPTUG_EF <- rnorm(3, predict(lm.imp.PPTUG_EF, CS2012C2), sigma(lm.imp.PPTUG_EF))
PPTUG_EF.imp <- impute(PPTUG_EF, pred.PPTUG_EF)
CS2012C2$PPTUG_EF <- PPTUG_EF.imp
#For PCTPELL
lm.imp.PCTPELL <- lm(PCTPELL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF, data = CS2012)
pred.PCTPELL <- rnorm(4, predict(lm.imp.PCTPELL, CS2012C2), sigma(lm.imp.PCTPELL))
PCTPELL.imp <- impute(PCTPELL, pred.PCTPELL)
CS2012C2$PCTPELL <- PCTPELL.imp
#For RPY_1YR_RT
lm.imp.RPY_1YR_RT <- lm(RPY_1YR_RT ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL, data = CS2012C2)
pred.RPY_1YR_RT <- rnorm(22, predict(lm.imp.RPY_1YR_RT, CS2012C2), sigma(lm.imp.RPY_1YR_RT))
RPY_1YR_RT.imp <- impute(RPY_1YR_RT, pred.RPY_1YR_RT)
CS2012C2$RPY_1YR_RT <- RPY_1YR_RT.imp
#For AVGFACSAL*3 missing
lm.imp.AVGFACSAL <- lm(AVGFACSAL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT, data = CS2012C2)
pred.AVGFACSAL <- rnorm(483, predict(lm.imp.AVGFACSAL, CS2012C2), sigma(lm.imp.AVGFACSAL))
AVGFACSAL.imp <- impute(AVGFACSAL, pred.AVGFACSAL)
CS2012C2$AVGFACSAL <- AVGFACSAL.imp
#For COSTT4_A*89
lm.imp.COSTT4_A <- lm(COSTT4_A ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL, data = CS2012C2)
pred.COSTT4_A <- rnorm(511, predict(lm.imp.COSTT4_A, CS2012C2), sigma(lm.imp.COSTT4_A))
COSTT4_A.imp <- impute(COSTT4_A, pred.COSTT4_A)
CS2012C2$COSTT4_A <- COSTT4_A.imp
#For PFTFTUG1_EF
lm.imp.PFTFTUG1_EF <- lm(PFTFTUG1_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A, data = CS2012C2)
pred.PFTFTUG1_EF <- rnorm(566, predict(lm.imp.PFTFTUG1_EF, CS2012C2), sigma(lm.imp.PFTFTUG1_EF))
PFTFTUG1_EF.imp <- impute(PFTFTUG1_EF, pred.PFTFTUG1_EF)
CS2012C2$PFTFTUG1_EF <- PFTFTUG1_EF.imp
#For SAT_AVG_ALL
lm.imp.SAT_AVG_ALL <- lm(SAT_AVG_ALL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF, data = CS2012C2)
pred.SAT_AVG_ALL <- rnorm(1644, predict(lm.imp.SAT_AVG_ALL, CS2012C2), sigma(lm.imp.SAT_AVG_ALL))
SAT_AVG_ALL.imp <- impute(SAT_AVG_ALL, pred.SAT_AVG_ALL)
CS2012C2$SAT_AVG_ALL <- SAT_AVG_ALL.imp
#For MD_EARN_WNE_P10
lm.imp.MD_EARN_WNE_P10 <- lm(MD_EARN_WNE_P10 ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF + SAT_AVG_ALL, data = CS2012C2)
pred.MD_EARN_WNE_P10 <- rnorm(161, predict(lm.imp.MD_EARN_WNE_P10, CS2012C2), sigma(lm.imp.MD_EARN_WNE_P10))
MD_EARN_WNE_P10.imp <- impute(MD_EARN_WNE_P10, pred.MD_EARN_WNE_P10)
CS2012C2$MD_EARN_WNE_P10 <- MD_EARN_WNE_P10.imp
```
```{r Formal Imputation C3: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
attach(CS2012C3) #Attaching the dataset so we can easily call on variables
impute <- function(a, a.impute){
ifelse(is.na(a), a.impute, a)
}
#For TUITFTE
lm.imp.TUITFTE <- lm(TUITFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY, data = CS2012C3)
pred.TUITFTE <- rnorm(1, predict(lm.imp.TUITFTE, CS2012C3), sigma(lm.imp.TUITFTE)) #where the number of missing observations is the first number
TUITFTE.imp <- impute(TUITFTE, pred.TUITFTE)
CS2012C3$TUITFTE <- TUITFTE.imp
#For INEXPFTE
lm.imp.INEXPFTE <- lm(INEXPFTE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE, data = CS2012C3)
pred.INEXPFTE <- rnorm(1, predict(lm.imp.INEXPFTE, CS2012C3), sigma(lm.imp.INEXPFTE))
INEXPFTE.imp <- impute(INEXPFTE, pred.INEXPFTE)
CS2012C3$INEXPFTE <- INEXPFTE.imp
#For PELL_EVER
lm.imp.PELL_EVER <- lm(PELL_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE, data = CS2012C3)
pred.PELL_EVER <- rnorm(1, predict(lm.imp.PELL_EVER, CS2012C3), sigma(lm.imp.PELL_EVER))
PELL_EVER.imp <- impute(PELL_EVER, pred.PELL_EVER)
CS2012C3$PELL_EVER <- PELL_EVER.imp
#For LOAN_EVER
lm.imp.LOAN_EVER <- lm(LOAN_EVER ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER, data = CS2012C3)
pred.LOAN_EVER <- rnorm(1, predict(lm.imp.LOAN_EVER, CS2012C3), sigma(lm.imp.LOAN_EVER))
LOAN_EVER.imp <- impute(LOAN_EVER, pred.LOAN_EVER)
CS2012C3$LOAN_EVER <- LOAN_EVER.imp
#For UGDS
lm.imp.UGDS <- lm(UGDS ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER, data = CS2012C3)
pred.UGDS <- rnorm(3, predict(lm.imp.UGDS, CS2012C3), sigma(lm.imp.UGDS))
UGDS.imp <- impute(UGDS, pred.UGDS)
CS2012C3$UGDS <- UGDS.imp
#For UGDS_ASIAN
lm.imp.UGDS_ASIAN <- lm(UGDS_ASIAN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS, data = CS2012C3)
pred.UGDS_ASIAN <- rnorm(3, predict(lm.imp.UGDS_ASIAN, CS2012C3), sigma(lm.imp.UGDS_ASIAN))
UGDS_ASIAN.imp <- impute(UGDS_ASIAN, pred.UGDS_ASIAN)
CS2012C3$UGDS_ASIAN <- UGDS_ASIAN.imp
#For UGDS_BLACK
lm.imp.UGDS_BLACK <- lm(UGDS_BLACK ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN, data = CS2012C3)
pred.UGDS_BLACK <- rnorm(3, predict(lm.imp.UGDS_BLACK, CS2012C3), sigma(lm.imp.UGDS_BLACK))
UGDS_BLACK.imp <- impute(UGDS_BLACK, pred.UGDS_BLACK)
CS2012C3$UGDS_BLACK <- UGDS_BLACK.imp
#For UGDS_HISP
lm.imp.UGDS_HISP <- lm(UGDS_HISP ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK, data = CS2012C3)
pred.UGDS_HISP <- rnorm(3, predict(lm.imp.UGDS_HISP, CS2012C3), sigma(lm.imp.UGDS_HISP))
UGDS_HISP.imp <- impute(UGDS_HISP, pred.UGDS_HISP)
CS2012C3$UGDS_HISP <- UGDS_HISP.imp
#For UGDS_NRA
lm.imp.UGDS_NRA <- lm(UGDS_NRA ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP, data = CS2012C3)
pred.UGDS_NRA <- rnorm(3, predict(lm.imp.UGDS_NRA, CS2012C3), sigma(lm.imp.UGDS_NRA))
UGDS_NRA.imp <- impute(UGDS_NRA, pred.UGDS_NRA)
CS2012C3$UGDS_NRA <- UGDS_NRA.imp
#For UGDS_WHITE
lm.imp.UGDS_WHITE <- lm(UGDS_WHITE ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA, data = CS2012C3)
pred.UGDS_WHITE <- rnorm(3, predict(lm.imp.UGDS_WHITE, CS2012C3), sigma(lm.imp.UGDS_WHITE))
UGDS_WHITE.imp <- impute(UGDS_WHITE, pred.UGDS_WHITE)
CS2012C3$UGDS_WHITE <- UGDS_WHITE.imp
#For UGDS_WOMEN
lm.imp.UGDS_WOMEN <- lm(UGDS_WOMEN ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE, data = CS2012C3)
pred.UGDS_WOMEN <- rnorm(3, predict(lm.imp.UGDS_WOMEN, CS2012C3), sigma(lm.imp.UGDS_WOMEN))
UGDS_WOMEN.imp <- impute(UGDS_WOMEN, pred.UGDS_WOMEN)
CS2012C3$UGDS_WOMEN <- UGDS_WOMEN.imp
#For PCIPSTEM
lm.imp.PCIPSTEM <- lm(PCIPSTEM ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN, data = CS2012C3)
pred.PCIPSTEM <- rnorm(3, predict(lm.imp.PCIPSTEM, CS2012C3), sigma(lm.imp.PCIPSTEM))
PCIPSTEM.imp <- impute(PCIPSTEM, pred.PCIPSTEM)
CS2012C3$PCIPSTEM <- PCIPSTEM.imp
#For PPTUG_EF
lm.imp.PPTUG_EF <- lm(PPTUG_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM, data = CS2012C3)
pred.PPTUG_EF <- rnorm(3, predict(lm.imp.PPTUG_EF, CS2012C3), sigma(lm.imp.PPTUG_EF))
PPTUG_EF.imp <- impute(PPTUG_EF, pred.PPTUG_EF)
CS2012C3$PPTUG_EF <- PPTUG_EF.imp
#For PCTPELL
lm.imp.PCTPELL <- lm(PCTPELL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF, data = CS2012)
pred.PCTPELL <- rnorm(4, predict(lm.imp.PCTPELL, CS2012C3), sigma(lm.imp.PCTPELL))
PCTPELL.imp <- impute(PCTPELL, pred.PCTPELL)
CS2012C3$PCTPELL <- PCTPELL.imp
#For RPY_1YR_RT
lm.imp.RPY_1YR_RT <- lm(RPY_1YR_RT ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL, data = CS2012C3)
pred.RPY_1YR_RT <- rnorm(22, predict(lm.imp.RPY_1YR_RT, CS2012C3), sigma(lm.imp.RPY_1YR_RT))
RPY_1YR_RT.imp <- impute(RPY_1YR_RT, pred.RPY_1YR_RT)
CS2012C3$RPY_1YR_RT <- RPY_1YR_RT.imp
#For AVGFACSAL*3 missing
lm.imp.AVGFACSAL <- lm(AVGFACSAL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT, data = CS2012C3)
pred.AVGFACSAL <- rnorm(483, predict(lm.imp.AVGFACSAL, CS2012C3), sigma(lm.imp.AVGFACSAL))
AVGFACSAL.imp <- impute(AVGFACSAL, pred.AVGFACSAL)
CS2012C3$AVGFACSAL <- AVGFACSAL.imp
#For COSTT4_A*89
lm.imp.COSTT4_A <- lm(COSTT4_A ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL, data = CS2012C3)
pred.COSTT4_A <- rnorm(511, predict(lm.imp.COSTT4_A, CS2012C3), sigma(lm.imp.COSTT4_A))
COSTT4_A.imp <- impute(COSTT4_A, pred.COSTT4_A)
CS2012C3$COSTT4_A <- COSTT4_A.imp
#For PFTFTUG1_EF
lm.imp.PFTFTUG1_EF <- lm(PFTFTUG1_EF ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A, data = CS2012C3)
pred.PFTFTUG1_EF <- rnorm(566, predict(lm.imp.PFTFTUG1_EF, CS2012C3), sigma(lm.imp.PFTFTUG1_EF))
PFTFTUG1_EF.imp <- impute(PFTFTUG1_EF, pred.PFTFTUG1_EF)
CS2012C3$PFTFTUG1_EF <- PFTFTUG1_EF.imp
#For SAT_AVG_ALL
lm.imp.SAT_AVG_ALL <- lm(SAT_AVG_ALL ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF, data = CS2012C3)
pred.SAT_AVG_ALL <- rnorm(1644, predict(lm.imp.SAT_AVG_ALL, CS2012C3), sigma(lm.imp.SAT_AVG_ALL))
SAT_AVG_ALL.imp <- impute(SAT_AVG_ALL, pred.SAT_AVG_ALL)
CS2012C3$SAT_AVG_ALL <- SAT_AVG_ALL.imp
#For MD_EARN_WNE_P10
lm.imp.MD_EARN_WNE_P10 <- lm(MD_EARN_WNE_P10 ~ CONTROL + REGION + PREDDEG + LOCALE + C150 + NPT4 + NUM4 + MINORITY + TUITFTE + INEXPFTE + PELL_EVER + LOAN_EVER + UGDS + UGDS_ASIAN + UGDS_BLACK + UGDS_HISP + UGDS_NRA + UGDS_WHITE + UGDS_WOMEN + PCIPSTEM + PPTUG_EF + PCTPELL + RPY_1YR_RT + AVGFACSAL + COSTT4_A + PFTFTUG1_EF + SAT_AVG_ALL, data = CS2012C3)
pred.MD_EARN_WNE_P10 <- rnorm(161, predict(lm.imp.MD_EARN_WNE_P10, CS2012C3), sigma(lm.imp.MD_EARN_WNE_P10))
MD_EARN_WNE_P10.imp <- impute(MD_EARN_WNE_P10, pred.MD_EARN_WNE_P10)
CS2012C3$MD_EARN_WNE_P10 <- MD_EARN_WNE_P10.imp
```
```{r Binding C1,C2,C3: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
#Bind the different groups together to recreate CS2012, convert to tibble, delete extra tables
CS2012 <- rbind(CS2012C1, CS2012C2, CS2012C3)
CS2012 <- as_data_frame(CS2012) #To aid ease of manipulation converted the data frame into a tibble.
rm(CS2012C1, CS2012C2, CS2012C3)
#Clean up
rm(list = ls(pattern = "lm"))
rm(list = ls(pattern = "pred"))
rm(list = ls(pattern = "UGDS"))
rm(list = ls(pattern = ".imp"))
```
Many of the variables needed to be recoded.
```{r Cleaning 5: Recoding, message = FALSE, warning = FALSE, tidy = TRUE}
CS2012$CONTROL <- recode_factor(CS2012$CONTROL, '1' = "Public", '2' = "Private nonprofit", '3' = "Private for-profit")
CS2012$REGION <- recode_factor(CS2012$REGION, '1' = "New England", '2' = "Mid East", '3' = "Great Lakes", '4' = "Plains", '5' = "Southeast", '6' = "Southwest", '7' = "Rocky Mountains", '8' = "Far West", '9' = "Outlying Areas")
CS2012$PREDDEG <- recode_factor(CS2012$PREDDEG, '0' = "Not classified", '1' = "Predominantly certificate-degree granting", '2' = "Predominantly associate's-degree granting", '3' = "Predominantly bachelor's-degree granting", '4' = "Entirely graduate-degree granting")
CS2012$LOCALE <- recode_factor(CS2012$LOCALE, '11' = "Large City", '12' = "Midsize City", '13' = "Small City", '21' = "Large Suburb", '22' = "Midsize Suburb", '23' = "Small Suburb", '31' = "Fringe Town", '32' = "Distant Town", '33' = "Remote Town", '41' = "Rural Fringe Territory", '42' = "Rural Distant Territory", '43' = "Rural Remote Territory")
CS2012$MINORITY <- recode_factor(CS2012$MINORITY, '0' = 'Non-minority', '1' = 'Single-Minority', '2' = 'Double-Minority')
#Next variables were reordered and renamed.
CS2012 <- CS2012 %>% select(INSTNM, REGION, LOCALE, CONTROL, MINORITY, TUITFTE, INEXPFTE, AVGFACSAL, COSTT4_A, PREDDEG, NPT4, NUM4, SAT_AVG_ALL, PCIPSTEM, UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, UGDS_NRA, UGDS_WOMEN, PPTUG_EF, PFTFTUG1_EF, C150, LOAN_EVER, PELL_EVER, PCTPELL, RPY_1YR_RT, MD_EARN_WNE_P10)
CS2012dup <- CS2012 #Creating a duplicate of the dataset for later use
#Note the distribution of the institutions at this point is 1326 public, 559 private for-profit, and 986 private nonprofit institutions
```
R automatically turns qualitative variables into dummy variables, but when this is done automatically it is difficult to perform variable selection. As a result, I converted qualitative variables into explicit dummy variables using the tidy package. This meant that all institutions had Public, Private nonprofit, and Private for-profit columns but only those in that category would have a 1.
```{r}
#Figure out how to take care of NAs
dummy <-function(x){ #function for quickly creating dummy variables
x <- as.numeric(x) #convert the variable to a numeric type and reassign
x[which(!is.na(x))]<-1 #the conversion above means that any non-numeric data will be converted to NA
x[which(is.na(x))]<-0
x #output the result
}
#Creating regional dummy variables from the REGION variable
CS2012 <- CS2012 %>% spread(REGION, REGION, 0) #Spreading the REGION variable
CS2012$`Far West` <- dummy(CS2012$`Far West`) #Using my dummy function to create a dummy variable here so that any institution in the Far West region will have a 1 and an institution anywhere else will have a 0, a similar methodology was followed for the other regional dummy variables below.
CS2012$`Great Lakes` <- dummy(CS2012$`Great Lakes`)
CS2012$`Mid East` <- dummy(CS2012$`Mid East`)
CS2012$`New England` <- dummy(CS2012$`New England`)
CS2012$Plains <- dummy(CS2012$Plains)
CS2012$`Rocky Mountains` <- dummy(CS2012$`Rocky Mountains`)
CS2012$Southeast <- dummy(CS2012$Southeast)
CS2012$Southwest <- dummy(CS2012$Southwest)
CS2012$`Outlying Areas` <- dummy(CS2012$`Outlying Areas`)
#Creating locale dummy variables from the LOCALE variable
CS2012 <- CS2012 %>% spread(LOCALE, LOCALE, 0) #Spreading the LOCALE variable
CS2012$`Distant Town` <- dummy(CS2012$`Distant Town`) #Using my dummy function to create a dummy variable here so that any institution located in a Distant Town will have a 1 and an institution anywhere else will have a 0, a similar methodology was followed for the other locale dummy variables below.
CS2012$`Fringe Town` <- dummy(CS2012$`Fringe Town`)
CS2012$`Large City` <- dummy(CS2012$`Large City`)
CS2012$`Large Suburb` <- dummy(CS2012$`Large Suburb`)
CS2012$`Midsize City` <- dummy(CS2012$`Midsize City`)
CS2012$`Midsize Suburb` <- dummy(CS2012$`Midsize Suburb`)
CS2012$`Remote Town` <- dummy(CS2012$`Remote Town`)
CS2012$`Rural Distant Territory` <- dummy(CS2012$`Rural Distant Territory`)
CS2012$`Rural Fringe Territory` <- dummy(CS2012$`Rural Fringe Territory`)
CS2012$`Rural Remote Territory` <- dummy(CS2012$`Rural Remote Territory`)
CS2012$`Small City` <- dummy(CS2012$`Small City`)
CS2012$`Small Suburb` <- dummy(CS2012$`Small Suburb`)
#Creating control dummy variables from the CONTROL variable
CS2012 <- CS2012 %>% spread(CONTROL, CONTROL, 0) #Spreading the CONTROL variable
CS2012$`Private for-profit` <- dummy(CS2012$`Private for-profit`) #Using my dummy function to create a dummy variable here so that any institution located in a Distant Town will have a 1 and an institution anywhere else will have a 0, a similar methodology was followed for the other control dummy variables below.
CS2012$`Private nonprofit` <- dummy(CS2012$`Private nonprofit`)
CS2012$Public <- dummy(CS2012$Public)
#Creating predicted degree dummy variables from the PREDDEG variable
CS2012 <- CS2012 %>% spread(PREDDEG, PREDDEG, 0) #Spreading the PREDDEG variable.
CS2012$`Predominantly associate's-degree granting` <- dummy(CS2012$`Predominantly associate's-degree granting`) #Using my dummy function to create a dummy variable here so that any institution that predominantly grants associate degrees will have a 1 and an institution that predominantly grants one of the other two types of degrees will have a 0, a similar methodology was followed for the other predicted degree dummy variables below.
CS2012$`Predominantly bachelor's-degree granting` <- dummy(CS2012$`Predominantly bachelor's-degree granting`)
CS2012$`Predominantly certificate-degree granting` <- dummy(CS2012$`Predominantly certificate-degree granting`)
CS2012$`Not classified` <- dummy(CS2012$`Not classified`)
#Creating a minority dummy variable from the MINORITY variable.
CS2012 <- CS2012 %>% spread(MINORITY, MINORITY, 0) #Spreading the MINORITY variable
CS2012$`Non-minority` <- dummy(CS2012$`Non-minority`) #Using my dummy function to create a dummy variable here so that any institution located in a Distant Town will have a 1 and an institution anywhere else will have a 0, a similar methodology was followed for the other control dummy variables below.
CS2012$`Single-Minority` <- dummy(CS2012$`Single-Minority`)
CS2012$`Double-Minority` <- dummy(CS2012$`Double-Minority`)
#Lastly, before beggining analysis variables were reordered, and renamed. Rather than eliminating one level of each category of my dummy variables (consistent with proper methodology), I simply eliminated my intercept. In this case, the estimated effects had the standard interpretation. For control this was Public, For predicted degree this was unclassified. For minority this was non-minority
CS2012 <- CS2012 %>% select(INSTNM, NEW_ENGLAND = `New England`, MID_EAST = `Mid East`, GREAT_LAKES = `Great Lakes`, PLAINS = Plains, SOUTH_EAST = Southeast, SOUTH_WEST = Southwest, ROCKY_MOUNT = `Rocky Mountains`, FAR_WEST = `Far West`, OUTLYING = `Outlying Areas`, LARGE_CITY = `Large City`, MID_SIZE_CITY = `Midsize City`, SMALL_CITY = `Small City`, LARGE_SUB = `Large Suburb`, MID_SIZE_SUB = `Midsize Suburb`, SMALL_SUB = `Small Suburb`, FRINGE_TOWN = `Fringe Town`, DIST_TOWN = `Distant Town`, REMOTE_TOWN = `Remote Town`, RURAL_FRINGE_TERR = `Rural Fringe Territory`, RURAL_DISTANT_TERR = `Rural Distant Territory`, RURAL_REMOTE_TERR = `Rural Remote Territory`, PUBLIC = Public, PRIV_NON = `Private nonprofit`, PRIV_4_PROF = `Private for-profit`, TUITFTE, INEXPFTE, AVGFACSAL, COSTT4_A, NOT_CLASSIFIED = `Not classified`, CERT_DEG = `Predominantly certificate-degree granting`, ASOC_DEG = `Predominantly associate's-degree granting`, BACH_DEG = `Predominantly bachelor's-degree granting`, NPT4, NUM4, SAT_AVG_ALL, PCIPSTEM, UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN, UGDS_NRA, UGDS_WOMEN, NON_MINORITY = `Non-minority`, SINGLE_MINORITY = `Single-Minority`, DOUBLE_MINORITY = `Double-Minority`, PPTUG_EF, PFTFTUG1_EF, C150, LOAN_EVER, PELL_EVER, PCTPELL, RPY_1YR_RT, MD_EARN_WNE_P10)
```
# Analysis
To analyze my data I used a value added approach as detailed in this [report]()at the beggining of this report. First, I carried out a multiple linear regression of all my variables on MD_EARN_WNE_P10.
To fit a multiple linear regression (MLR) model of all the predictor variables on median earnings after 10 years using least squares.
```{r MLR Model 1, message = FALSE, warning = FALSE, tidy = TRUE}
attach(CS2012) #Attaching the dataset so we can easily call on variables
MLR = lm(MD_EARN_WNE_P10~.0 + NEW_ENGLAND + MID_EAST + GREAT_LAKES + PLAINS + SOUTH_EAST + SOUTH_WEST + ROCKY_MOUNT + FAR_WEST + OUTLYING + LARGE_CITY + MID_SIZE_CITY + SMALL_CITY + LARGE_SUB + MID_SIZE_SUB + SMALL_SUB + FRINGE_TOWN + DIST_TOWN + REMOTE_TOWN + RURAL_FRINGE_TERR + RURAL_DISTANT_TERR + RURAL_REMOTE_TERR + PUBLIC + PRIV_NON + PRIV_4_PROF + TUITFTE + INEXPFTE + AVGFACSAL + COSTT4_A + NOT_CLASSIFIED + CERT_DEG + ASOC_DEG + BACH_DEG + NPT4 + NUM4 + SAT_AVG_ALL + PCIPSTEM + UGDS + UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN + UGDS_NRA + UGDS_WOMEN + NON_MINORITY + SINGLE_MINORITY + DOUBLE_MINORITY + PPTUG_EF + PFTFTUG1_EF + C150 + LOAN_EVER + PELL_EVER + PCTPELL + RPY_1YR_RT,data = CS2012)
MLRfit <- summary(MLR)
MLRfit #Output the regression coefficients for all the predictors
```
## Variable selection
Although some of my variables were significant, many were not. However, the model with an F-stat of `r MLRfit$fstatistic[1]` was significant at the 95% confidence level, indicating that there is a relationship between some of the variables and median earnings after 10 years.Consequently, I carried out Backward selection until all my variables were significant. Backward selection is a variable selection approach where, "We start with all variables in the model, and remove the variable with the largest p-value that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached" (James, Witten, Hastie, & Tibshirani, 2015). In this case, the stopping rule was that all variables must be significant to at least the 95%. I also removed any dummy variables causing singularities. The final MLR model reached was as follows;
```{r MLR Final Model, message = FALSE, warning = FALSE, tidy = TRUE}
MLR2 = lm(MD_EARN_WNE_P10~.0 + NEW_ENGLAND + MID_EAST + GREAT_LAKES + PLAINS + SOUTH_EAST + SOUTH_WEST + ROCKY_MOUNT + FAR_WEST + OUTLYING + LARGE_CITY + MID_SIZE_CITY + SMALL_CITY + LARGE_SUB + MID_SIZE_SUB + SMALL_SUB + FRINGE_TOWN + DIST_TOWN + REMOTE_TOWN + RURAL_FRINGE_TERR + RURAL_DISTANT_TERR + RURAL_REMOTE_TERR + PUBLIC + PRIV_NON + TUITFTE + INEXPFTE + AVGFACSAL + CERT_DEG + SAT_AVG_ALL + PCIPSTEM + UGDS + UGDS_WHITE + UGDS_ASIAN + UGDS_NRA + UGDS_WOMEN + PPTUG_EF + PFTFTUG1_EF + C150 + LOAN_EVER + PELL_EVER + RPY_1YR_RT,data = CS2012)#This model was fit with only these variables.
MLR2fit <- summary(MLR2)
MLR2fit #Output the regression coefficients for all the predictors.
```
## Model Explanation
The final model was chosen because it had several advantages over the first model. First, due to model selection it was parsimonious, having 40 variables where the first had had 54. All variables were significant at least at the 95% confidence interval, and in fact most variables were significant at the 99.999% confidence interval. The model itself had an adjusted R-squared of `r MLR2fit$adj.r.squared`. Lastly, the F-statistic remained highly significant indicating that the model was valid. Although a full discussion of the model results is not necessary here, the impact of these variables on alumni median earnings after 10 years was roughly as expected.
Predicting Median Earnings using the final model
```{r Prediction, message = FALSE, warning = FALSE, tidy = TRUE}
Predictions <- predict(MLR2, CS2012, interval = "prediction") #Used the predict() function to get the predicted value of median earnings and the lower and upper 95% prediction intervals for each institution and saved it in an object. The prediction intervals indicate that 95% of intervals of this form will contain the true value of median earnings for the institution.
CS2012_pred <- as_data_frame(cbind(CS2012dup,Predictions))%>% select(PRED_EARN = fit, -c(lwr, upr))
#Took several actions here
#Combined the Prediction object and CS2012 duplicate dataset.
#To aid ease of manipulation converted the data frame into a tibble.
CS2012_pred <- as_data_frame(cbind(CS2012dup,CS2012_pred))
#Took out the variables I am interested in, renamed them, and reassigned them.
CS2012_pred <- CS2012_pred %>% mutate(VALUE_ADDED = MD_EARN_WNE_P10 - PRED_EARN) #Created a value-added variable that shows how much an institution over or underperformed.
CS2012_pred <- CS2012_pred %>% arrange(desc(VALUE_ADDED)) #Reordered the list using the VALUE_ADDED variable.
CS2012_pred$PERFORMANCE <- ""
CS2012_pred$PERFORMANCE[which(CS2012_pred$VALUE_ADDED > 0)] <- "Overperfomed"
CS2012_pred$PERFORMANCE[which(CS2012_pred$VALUE_ADDED < 0)] <- "Underperformed"
```
## Segmentation of Data***HERE
I decided to segment my data by taking the top, middle, and bottom deciles. Since I had 993 observations this was roughly 100 institutions in each group.
```{r Data Segmentation, message = FALSE, warning = FALSE, tidy = TRUE}
TOP100 <- CS2012_pred %>% slice(1:967) #Selected the first 100 rows
MID100 <- CS2012_pred %>% slice(968:1934) #Selected the middle 101 rows
BOTTOM100 <- CS2012_pred %>% slice(1935:2902) #Selected the bottom 100 rows
```
I then summarized each segment into a simple tabulation of data that brought out the key insights.
```{r Tabulation of Segmentation, message = FALSE, warning = FALSE, tidy = TRUE}
#Simplified the table to show institution names and value-added
TOP100 <- TOP100 %>% select(INSTNM, CONTROL, REGION, PREDDEG, LOCALE, MINORITY, VALUE_ADDED, PERFORMANCE)
MID100 <- MID100 %>% select(INSTNM, CONTROL, REGION, PREDDEG, LOCALE, MINORITY, VALUE_ADDED, PERFORMANCE)
BOTTOM100 <- BOTTOM100 %>% select(INSTNM, CONTROL, REGION, PREDDEG, LOCALE, MINORITY, VALUE_ADDED, PERFORMANCE)
BOTTOM100 <- BOTTOM100 %>% arrange(VALUE_ADDED) #Reordered the list using the VALUE_ADDED variable to see worst schools.
```
**INSIGHT: While there are no clear differences between the Control of the institution and the predominant degree awarded across segments, the same is not true of region and locale.**
# Vizualization
Consequently, it is clear that region and locale are two easily understandable variables/metrics that can be used to differentiate institutions. Below I demonstrate this for the top 100, middle 100, and bottom 100 universities.
```{r Vizualization 1, message = FALSE, warning =FALSE, tidy = TRUE}
TOP100 %>% ggplot(aes(x = LOCALE, fill = LOCALE)) + geom_bar() + xlab("Type of Locale") + ylab("The number of top 100 Institutions") + ggtitle("Barplot of the number of the top 100 institutions in each type of locale") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of TOP100 institutions in each type of LOCALE
TOP100 %>% ggplot(aes(x = LOCALE, fill = LOCALE)) + geom_bar() + facet_wrap(~REGION) + xlab("Type of Locale") + ylab("The number of top 100 Institutions") + ggtitle("Barplot of the number of top 100 institutions in each type of locale faceted by region") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of TOP100 institutions in each type of LOCALE faceted by REGION
```
**The top 100 institutions seems to cluster generally in Large suburbs, and cities. Additionally, they seem to be most prevalent in the Mid East, New England and the Southeast**
```{r Vizualization 2, message = FALSE, warning =FALSE, tidy = TRUE}
MID100 %>% ggplot(aes(x = LOCALE, fill = LOCALE)) + geom_bar() + xlab("Type of Locale") + ylab("The Number of mid 100 Institutions") + ggtitle("Barplot of the number of mid 100 institutions in each type of locale") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of MID100 institutions in each type of LOCALE
MID100 %>% ggplot(aes(x = LOCALE, fill = LOCALE)) + geom_bar() + facet_wrap(~REGION) + xlab("Type of Locale") + ylab("The Number of mid 100 Institutions") + ggtitle("Barplot of the number of middle 100 institutions in each type of locale faceted by region") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of MID100 institutions in each type of LOCALE faceted by REGION
```
**The middle 100 institutions seems to cluster generally in Distant Towns, Large suburbs, and cities. Additionally, they seem to be most prevalent in the Great Lakes, Plains and the Southeast**
```{r Vizualization 3, message = FALSE, warning =FALSE, tidy = TRUE}
BOTTOM100 %>% ggplot(aes(x = CONTROL, fill = CONTROL)) + geom_bar() + xlab("Type of Locale") + ylab("The Number of bottom 100 Institutions") + ggtitle("Barplot of the number of bottom 100 institutions in each type of locale") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of BOTTOM100 institutions in each type of LOCALE
BOTTOM100 %>% ggplot(aes(x = LOCALE, fill = LOCALE)) + geom_bar() + facet_wrap(~REGION) + xlab("Type of Locale") + ylab("The Number of bottom 100 Institutions") + ggtitle("Barplot of the number of bottom 100 institutions in each type of locale faceted by region") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) #Creating a barplot of the number of BOTTOM100 institutions in each type of LOCALE faceted by REGION
```
**The bottom 100 institutions seems to cluster generally in Large cities, Distant Towns, and Large suburbs. Additionally, they seem to be most prevalent in the Far West,and the Mideast**