-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.qmd
1264 lines (950 loc) · 50.6 KB
/
report.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Factors Most Important to University Ranking"
subtitle: "Report"
format: html
editor: visual
execute:
echo: false
message: false
warning: false
editor_options:
chunk_output_type: console
---
```{r}
#| label: load-pkgs
#| message: false
library(tidyverse)
library(corrplot)
library(tidymodels)
library(scales)
library(ggthemes)
library(gridExtra)
library(knitr)
```
## Introduction
### Research Background
Every year, millions of students apply to colleges across the United States, and many of them use college rankings lists from sources such as Niche.com to help them decide where to apply. In recent years, these lists have been heavily criticized as they encourage universities to prioritize certain metrics and manipulate the system to raise their ranks. It is important for students to know where these rankings come from and what they actually measure. In this project, we will explore how influential different metrics are in determining a college's ranking. For reference, when we say a "low rank," we refer to schools with a lower numerical rank, such as #1. When we say a "high rank," we refer to schools with a high numerical rank, such as #500.
### Data
```{r}
#| label: import
niche_data_500 <- read_csv("data/niche_data_500.csv")
us_dep_of_ed <- read_csv("data/us_dep_of_ed.csv", na = c("NULL", "PrivacySuppressed"))
```
For our analysis, we joined two datasets:
**Data Set #1 - Niche:** Niche's "2023 Best Colleges in America" list aggregates data from a variety of sources, including the US Department of Education and reviews from students and alumni, and is updated monthly. The Niche data was scraped by Maia on October 17-19 2022. There are `r nrow(niche_data_500)` observations, representing the top 500 schools in the United States. Each observation has two variables: `college` (institution name) and `rank`.
**Data Set #2 - US Department of Education:** The second data set comes from the US Department of Education's College Scorecard, which is an exhaustive summary of characteristics and statistics for all colleges and universities in the United States. The College Scorecard is updated by the Education Department as it collects new data, but most of the data comes from the 2020-2021 school year. Data used in the scorecard comes from data reported by the institutions, data on federal financial aid, data from taxes, and data from other federal agencies. There were 2,989 variables in the original data set, many of which we don't need to answer our question, so we selected `r ncol(us_dep_of_ed)` variables before importing into RStudio. We further narrowed to the 31 variables we thought could have the most influence on rank. There are `r nrow(us_dep_of_ed)` observations, representing all US colleges and universities.
**Variable Summary:**
- `college`: Institution name (Categorical)
- `rank`: Rank of school on Niche list (Quantitative)
- `REGION`: US geographic region (C)
- `ACCREDAGENCY`: Accrediting agency for institution (C)
- `CONTROL`: Public, Private nonprofit, or Private for-profit (C)
- `CCBASIC`: Carnegie Classification (basic) (C)
- `ADM_RATE`: Admission rate (Q)
- `UGDS`: Enrollment of undergraduate certificate/degree-seeking students (Q)
- `UGDS_WHITE`, `UGDS_BLACK`, `UGDS_HISP`, `UGDS_ASIAN`, `UGDS_AIAN`, `UGDS_NHPI`, `UGDS_2MOR`, and `UGDS_UNKN`: Enrollment of undergraduate students of each racial/ethnic group (Q)
- `NPT4_PUB`, `NPT4_PRIV`: Average net price for Title IV institutions (public and private) (Q)
- `COSTT4_A`, `COSTT4_P`: Average cost of attendance (academic- and program-year institutions) (Q)
- `AVGFACSAL`: Average faculty salary (Q)
- `PCTPELL`: Percentage of undergraduates who receive a Pell Grant (Q)
- `C150_4`: Completion rate for first-time, full-time students at four-year institutions (Q)
- `AGE_ENTRY`: Average age of entry (Q)
- `FEMALE`: Share of female students (Q)
- `MARRIED`: Share of married students (Q)
- `FIRST_GEN`: Share of first-generation students (Q)
- `FAMINC`, `MD_FAMINC`: Average and median family income (Q)
- `ENDOWBEGIN`: Value of school's endowment at the beginning of the fiscal year (Q)
- `SAT_AVG`: Average SAT equivalent score of students admitted (Q)
- `ACTCMMID`: Midpoint of the ACT cumulative score (Q)
#### Data Preparation
1. To get the data, we scraped from Niche.com and downloaded data from the US Department of Education. The steps were done in `niche-scrape.R`
2. Naming discrepancies between the two data sets were fixed to ensure as successful merge. `University of South Florida-Sarasota-Manatee` and `University of South Florida-St. Petersburg` were removed because they did not exist in both datasets.
```{r}
#| label: fix-college-names
us_dep_of_ed <- us_dep_of_ed |>
mutate(
INSTNM = case_when(
INSTNM == "Columbia University in the City of New York" ~
"Columbia University",
INSTNM == "Washington University in St Louis" ~
"Washington University in St. Louis",
INSTNM == "University of California-Los Angeles" ~
"University of California - Los Angeles",
INSTNM == "University of Michigan-Ann Arbor" ~
"University of Michigan - Ann Arbor",
INSTNM == "Georgia Institute of Technology-Main Campus" ~
"Georgia Institute of Technology",
INSTNM == "University of Virginia-Main Campus" ~
"University of Virginia",
INSTNM == "United States Military Academy" ~
"United States Military Academy at West Point",
INSTNM == "The University of Texas at Austin" ~
"University of Texas - Austin",
INSTNM == "University of California-Berkeley" ~
"University of California - Berkeley",
INSTNM == "University of California-Irvine" ~
"University of California - Irvine",
INSTNM == "University of California-San Diego" ~
"University of California - San Diego",
INSTNM == "University of California-Davis" ~
"University of California - Davis",
INSTNM == "University of California-Santa Barbara" ~
"University of California - Santa Barbara",
INSTNM == "University of California-Santa Cruz" ~
"University of California - Santa Cruz",
INSTNM == "University of California-Riverside" ~
"University of California - Riverside",
INSTNM == "University of North Carolina Wilmington" ~
"University of North Carolina - Wilmington",
INSTNM == "University of North Carolina at Greensboro" ~
"University of North Carolina - Greensboro",
INSTNM == "Albany College of Pharmacy and Health Sciences" ~
"Albany College of Pharmacy & Health Sciences",
INSTNM == "Arizona State University Campus Immersion" ~
"Arizona State University",
INSTNM == "Arizona State University-Downtown Phoenix" ~
"Arizona State University - Downtown Phoenix Campus",
INSTNM == "Augustana College" ~
"Augustana College - Illinois",
UNITID == "173160" ~
"Bethel University - Minnesota",
INSTNM == "Binghamton University" ~
"Binghamton University, SUNY",
INSTNM == "Bowling Green State University-Main Campus" ~
"Bowling Green State University",
INSTNM == "California Polytechnic State University-San Luis Obispo" ~
"California Polytechnic State University (Cal Poly) - San Luis Obispo",
INSTNM == "California State University-Fullerton" ~
"California State University - Fullerton",
INSTNM == "California State University-Long Beach" ~
"California State University - Long Beach",
INSTNM == "The College of Wooster" ~
"College of Wooster",
INSTNM == "Colorado State University-Fort Collins" ~
"Colorado State University",
INSTNM == "Concordia University-Wisconsin" ~
"Concordia University - Wisconsin",
INSTNM == "CUNY Bernard M Baruch College" ~
"CUNY Baruch College",
INSTNM == "CUNY City College" ~
"CUNY City College of New York",
INSTNM == "D'Youville College" ~
"D'Youville",
INSTNM == "Eastern New Mexico University-Main Campus" ~
"Eastern New Mexico University",
INSTNM == "Embry-Riddle Aeronautical University-Daytona Beach" ~
"Embry-Riddle Aeronautical University - Daytona Beach",
INSTNM == "Embry-Riddle Aeronautical University-Worldwide" ~
"Embry-Riddle Aeronautical University - Worldwide",
INSTNM == "Florida Agricultural and Mechanical University" ~
"Florida A&M University",
INSTNM == "Franklin and Marshall College" ~
"Franklin & Marshall College",
INSTNM == "Hobart William Smith Colleges" ~
"Hobart and William Smith Colleges",
INSTNM == "Indiana University-Bloomington" ~
"Indiana University - Bloomington",
INSTNM == "Indiana University-East" ~
"Indiana University - East",
INSTNM == "Indiana University-Purdue University-Indianapolis" ~
"Indiana University-Purdue University - Indianapolis (IUPUI)",
INSTNM == "Indiana Wesleyan University-Marion" ~
"Indiana Wesleyan University",
INSTNM == "Keiser University-Ft Lauderdale" ~
"Keiser University - Fort Lauderdale",
INSTNM == "Kent State University at Kent" ~
"Kent State University",
INSTNM == "Louisiana State University and Agricultural & Mechanical College" ~
"Louisiana State University",
UNITID == "151786" ~
"Marian University Indianapolis",
INSTNM == "Maryville University of Saint Louis" ~
"Maryville University",
INSTNM == "Metropolitan State University" ~
"Metropolitan State University - Minnesota",
INSTNM == "Miami University-Oxford" ~
"Miami University",
INSTNM == "Minnesota State University-Mankato" ~
"Minnesota State University, Mankato",
INSTNM == "Molloy College" ~
"Molloy University",
INSTNM == "Monroe College" ~
"Monroe College - Bronx/New Rochelle",
INSTNM == "Mount Saint Mary's University" ~
"Mount Saint Mary's University Los Angeles",
INSTNM == "New Mexico State University-Main Campus" ~
"New Mexico State University",
INSTNM == "New Mexico Institute of Mining and Technology" ~
"New Mexico Tech",
INSTNM == "North Carolina State University at Raleigh" ~
"North Carolina State University",
INSTNM == "North Dakota State University-Main Campus" ~
"North Dakota State University",
UNITID == "154101" ~
"Northwestern College - Iowa",
INSTNM == "Ohio University-Main Campus" ~
"Ohio University",
INSTNM == "Oklahoma State University-Main Campus" ~
"Oklahoma State University",
INSTNM == "Pacific University" ~
"Pacific University Oregon",
INSTNM == "The Pennsylvania State University" ~
"Penn State",
INSTNM == "Pennsylvania State University-Penn State Fayette- Eberly" ~
"Penn State Fayette, The Eberly Campus",
INSTNM == "Pennsylvania State University-Penn State York" ~
"Penn State York",
INSTNM == "Purdue University-Main Campus" ~
"Purdue University",
INSTNM == "Rutgers University-New Brunswick" ~
"Rutgers University - New Brunswick",
INSTNM == "Rutgers University-Newark" ~
"Rutgers University - Newark",
INSTNM == "The University of the South" ~
"Sewanee - The University of the South",
INSTNM == "South Dakota School of Mines and Technology" ~
"South Dakota School of Mines & Technology",
INSTNM == "St Bonaventure University" ~
"St. Bonaventure University",
INSTNM == "Saint John Fisher College" ~
"St. John Fisher University",
INSTNM == "St. Joseph's University-New York" ~
"St. Joseph's University - New York",
INSTNM == "St Lawrence University" ~
"St. Lawrence University",
INSTNM == "St Olaf College" ~
"St. Olaf College",
INSTNM == "Stanbridge University" ~
"Stanbridge University - Orange County",
INSTNM == "Stephen F Austin State University" ~
"Stephen F. Austin State University",
INSTNM == "Stony Brook University" ~
"Stony Brook University, SUNY",
INSTNM == "SUNY College of Environmental Science and Forestry" ~
"SUNY College of Environmental Science & Forestry",
INSTNM == "Farmingdale State College" ~
"SUNY Farmingdale State College",
INSTNM == "State University of New York at New Paltz" ~
"SUNY New Paltz",
INSTNM == "Texas A & M University-College Station" ~
"Texas A&M University",
INSTNM == "American Musical and Dramatic Academy" ~
"The American Musical and Dramatic Academy (AMDA) - New York",
INSTNM == "The College of Saint Scholastica" ~
"The College of St. Scholastica",
INSTNM == "Cooper Union for the Advancement of Science and Art" ~
"The Cooper Union for the Advancement of Science and Art",
INSTNM == "The Master's University and Seminary" ~
"The Master's University",
INSTNM == "Ohio State University-Main Campus" ~
"The Ohio State University",
INSTNM == "University of Alabama in Huntsville" ~
"The University of Alabama in Huntsville",
INSTNM == "University of Baltimore" ~
"The University of Baltimore",
INSTNM == "University of Tulsa" ~
"The University of Tulsa",
INSTNM == "University of Virginia's College at Wise" ~
"The University of Virginia's College at Wise",
INSTNM == "Trinity College" ~
"Trinity College - Connecticut",
INSTNM == "Tulane University of Louisiana" ~
"Tulane University",
UNITID == "196866" ~
"Union College - New York",
INSTNM == "University at Buffalo" ~
"University at Buffalo, SUNY",
INSTNM == "University of Alabama at Birmingham" ~
"University of Alabama - Birmingham",
INSTNM == "University of Cincinnati-Main Campus" ~
"University of Cincinnati",
INSTNM == "University of Colorado Denver/Anschutz Medical Campus" ~
"University of Colorado Denver",
INSTNM == "University of Maryland-College Park" ~
"University of Maryland - College Park",
INSTNM == "University of Maryland-Baltimore County" ~
"University of Maryland, Baltimore County",
INSTNM == "University of Massachusetts-Amherst" ~
"University of Massachusetts - Amherst",
INSTNM == "University of Massachusetts-Lowell" ~
"University of Massachusetts Lowell",
INSTNM == "University of Michigan-Dearborn" ~
"University of Michigan - Dearborn",
INSTNM == "University of Minnesota-Crookston" ~
"University of Minnesota Crookston",
INSTNM == "University of Minnesota-Duluth" ~
"University of Minnesota Duluth",
INSTNM == "University of Minnesota-Twin Cities" ~
"University of Minnesota Twin Cities",
INSTNM == "University of Missouri-Columbia" ~
"University of Missouri",
INSTNM == "Embry-Riddle Aeronautical University-Prescott" ~
"Embry-Riddle Aeronautical University - Prescott",
INSTNM == "South Dakota State University" ~
"South Dakota State University",
INSTNM == "St Catherine University" ~
"St. Catherine University",
INSTNM == "University of Missouri-Kansas City" ~
"University of Missouri - Kansas City",
INSTNM == "University of Missouri-St Louis" ~
"University of Missouri - St. Louis",
INSTNM == "University of Nebraska-Lincoln" ~
"University of Nebraska - Lincoln",
INSTNM == "University of Nevada-Reno" ~
"University of Nevada - Reno",
INSTNM == "University of New Hampshire-Main Campus" ~
"University of New Hampshire",
INSTNM == "University of Oklahoma-Norman Campus" ~
"University of Oklahoma",
INSTNM == "University of Pittsburgh-Pittsburgh Campus" ~
"University of Pittsburgh",
INSTNM == "University of South Carolina-Columbia" ~
"University of South Carolina",
INSTNM == "University of St Francis" ~
"University of St. Francis - Illinois",
UNITID == "174914" ~
"University of St. Thomas - Minnesota",
UNITID == "227863" ~
"University of St. Thomas - Texas",
INSTNM == "The University of Tennessee-Knoxville" ~
"University of Tennessee",
INSTNM == "The University of Tennessee-Martin" ~
"University of Tennessee at Martin",
INSTNM == "The University of Texas at Arlington" ~
"University of Texas - Arlington",
INSTNM == "The University of Texas at Dallas" ~
"University of Texas - Dallas",
INSTNM == "The University of Texas Permian Basin" ~
"University of Texas - Permian Basin",
INSTNM == "The University of Texas Rio Grande Valley" ~
"University of Texas - Rio Grande Valley",
INSTNM == "The University of Texas at Tyler" ~
"University of Texas - Tyler",
INSTNM == "University of Washington-Seattle Campus" ~
"University of Washington",
INSTNM == "University of Washington-Bothell Campus" ~
"University of Washington - Bothell",
INSTNM == "University of Washington-Tacoma Campus" ~
"University of Washington - Tacoma",
INSTNM == "The University of West Florida" ~
"University of West Florida",
INSTNM == "University of Wisconsin-Madison" ~
"University of Wisconsin",
INSTNM == "University of Wisconsin-La Crosse" ~
"University of Wisconsin - La Crosse",
INSTNM == "Virginia Polytechnic Institute and State University" ~
"Virginia Tech",
INSTNM == "West Texas A & M University" ~
"West Texas A&M University",
UNITID == "230807" ~
"Westminster College - Utah",
INSTNM == "Wheaton College" ~
"Wheaton College - Illinois",
INSTNM == "Wheaton College (Massachusetts)" ~
"Wheaton College - Massachusetts",
TRUE ~ INSTNM
)
)
```
3. The data sets were joined using the `left_join()` function. The resulting data set has 498 observations with 33 variables.
```{r}
#| label: join-datasets
colleges <- niche_data_500 |>
left_join(us_dep_of_ed, by = c("college" = "INSTNM")) |>
filter(rank != 141) |>
filter(rank != 181)
```
```{r}
#| label: select-variables
chosen_variables <- c(
"college",
"rank",
"REGION",
"ACCREDAGENCY",
"CONTROL",
"CCBASIC",
"ADM_RATE",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"UGDS_AIAN",
"UGDS_NHPI",
"UGDS_2MOR",
"UGDS_NRA",
"UGDS_UNKN",
"NPT4_PUB",
"NPT4_PRIV",
"COSTT4_A",
"COSTT4_P",
"AVGFACSAL",
"PCTPELL",
"C150_4",
"AGE_ENTRY",
"FEMALE",
"MARRIED",
"FIRST_GEN",
"FAMINC",
"MD_FAMINC",
"ENDOWBEGIN",
"SAT_AVG",
"ACTCMMID"
)
data <- colleges |>
dplyr::select(chosen_variables)
```
4. Categorical variables whose levels were listed as numbers were updated to reflect their interpretable levels.
```{r}
#| label: clean-categorical-variables
data <- data |>
mutate(
REGION = as.character(REGION),
REGION = case_when(
REGION == 1 ~ "New England",
REGION == 2 ~ "Mid East",
REGION == 3 ~ "Great Lakes",
REGION == 4 ~ "Plains",
REGION == 5 ~ "Southeast",
REGION == 6 ~ "Southwest",
REGION == 7 ~ "Rocky Mountains",
REGION == 8 ~ "Far West",
REGION == 9 ~ "Outlying Areas"
),
CONTROL = as.character(CONTROL),
CONTROL = case_when(
CONTROL == 1 ~ "Public",
CONTROL == 2 ~ "Private, Non-profit",
CONTROL == 3 ~ "Private, For-profit"
),
CCBASIC = as.character(CCBASIC),
CCBASIC = case_when(
CCBASIC == -2 ~ "Not applicable",
CCBASIC == 0 ~ "Not classified",
CCBASIC == 1 ~ "Associate's Colleges: High Transfer-High Traditional",
CCBASIC == 2 ~ "Associate's Colleges: High Transfer-Mixed Traditional/Nontraditional",
CCBASIC == 3 ~ "Associate's Colleges: High Transfer-High Nontraditional",
CCBASIC == 4 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-High Traditional",
CCBASIC == 5 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-Mixed Traditional/Nontraditional",
CCBASIC == 6 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-High Nontraditional",
CCBASIC == 7 ~ "Associate's Colleges: High Career & Technical-High Traditional",
CCBASIC == 8 ~ "Associate's Colleges: High Career & Technical-Mixed Traditional/Nontraditional",
CCBASIC == 9 ~ "Associate's Colleges: High Career & Technical-High Nontraditional",
CCBASIC == 10 ~ "Special Focus Two-Year: Health Professions",
CCBASIC == 11 ~ "Special Focus Two-Year: Technical Professions",
CCBASIC == 12 ~ "Special Focus Two-Year: Arts & Design",
CCBASIC == 13 ~ "Special Focus Two-Year: Other Fields",
CCBASIC == 14 ~ "Baccalaureate/Associate's Colleges: Associate's Dominant",
CCBASIC == 15 ~ "Doctoral Universities: Very High Research Activity",
CCBASIC == 16 ~ "Doctoral Universities: High Research Activity",
CCBASIC == 17 ~ "Doctoral/Professional Universities",
CCBASIC == 18 ~ "Master's Colleges & Universities: Larger Programs",
CCBASIC == 19 ~ "Master's Colleges & Universities: Medium Programs",
CCBASIC == 20 ~ "Master's Colleges & Universities: Small Programs",
CCBASIC == 21 ~ "Baccalaureate Colleges: Arts & Sciences Focus",
CCBASIC == 22 ~ "Baccalaureate Colleges: Diverse Fields",
CCBASIC == 23 ~ "Baccalaureate/Associate's Colleges: Mixed Baccalaureate/Associate's",
CCBASIC == 24 ~ "Special Focus Four-Year: Faith-Related Institutions",
CCBASIC == 25 ~ "Special Focus Four-Year: Medical Schools & Centers",
CCBASIC == 26 ~ "Special Focus Four-Year: Other Health Professions Schools",
CCBASIC == 27 ~ "Special Focus Four-Year: Engineering Schools",
CCBASIC == 28 ~ "Special Focus Four-Year: Other Technology-Related Schools",
CCBASIC == 29 ~ "Special Focus Four-Year: Business & Management Schools",
CCBASIC == 30 ~ "Special Focus Four-Year: Arts, Music & Design Schools",
CCBASIC == 31 ~ "Special Focus Four-Year: Law Schools",
CCBASIC == 32 ~ "Special Focus Four-Year: Other Special Focus Institutions",
CCBASIC == 33 ~ "Tribal Colleges"
)
)
```
5. All of the numerical variables are on different scales, so we created a scaled version of the numerical explanatory variables, with mean 0 and standard deviation 1.
```{r}
#| label: scale-numerical-variables
scaled_continuous_numeric_variables <- data |>
select_if(is.numeric) |>
dplyr::select(-rank) |>
scale() |>
data.frame()
other_variables <- data |>
dplyr::select(rank | REGION | CONTROL | CCBASIC | !where(is.numeric))
scaled_data <- cbind(other_variables, scaled_continuous_numeric_variables) |>
relocate(college)
```
#### Exploratory Data Analysis
**Means of Selected Numerical Variables by Rank Group**
```{r}
#| label: exploration-by-rank-group
colleges_levels <- data |>
mutate(
level = case_when(
rank <= 100 ~ "1-100",
rank > 100 & rank <= 200 ~ "101-200",
rank > 200 & rank <= 300 ~ "201-300",
rank > 300 & rank <= 400 ~ "301-400",
rank > 400 & rank <= 500 ~ "401-500",
),
.after = rank
)
colleges_levels |>
group_by(level) |>
summarize(
mean_ADM_Rate = mean(ADM_RATE, na.rm = TRUE),
mean_SAT_AVG = mean(SAT_AVG, na.rm = TRUE),
mean_ACTCMMID = mean(ACTCMMID, na.rm = TRUE),
mean_UGDS_WHITE = mean(UGDS_WHITE, na.rm = TRUE),
mean_UGDS_ASIAN = mean(UGDS_ASIAN, na.rm = TRUE),
mean_FAMINC = mean(FAMINC, na.rm = TRUE)
) |>
rename(
"Interval" = level,
"Mean Admission Rate" = mean_ADM_Rate,
"Mean SAT Average" = mean_SAT_AVG,
"Mean ACT Median" = mean_ACTCMMID,
"Mean % White Students" = mean_UGDS_WHITE,
"Mean % Asian Students" = mean_UGDS_ASIAN,
"Mean Family Income" = mean_FAMINC
) |>
kable()
```
As the rank group gets higher, the mean admission rate increases, and the mean SAT Average, ACT Median, and family income decreases. 1-100 ranked schools have fewer White students and more Asian students than all other rank groups. We observed a relationship between the categorical variables and rank and included that exploratory analysis in the appendix.
### Research Question and Hypothesis
**Question:** Which characteristics of a university are most associated with rankings on the Niche College Ranking list? Of these characteristics, what is the relationship between high and low rank?
**Hypothesis:** We hypothesize that SAT/ACT scores, acceptance rate, and family income will have the strongest association with rank because since Niche's audience is in large part students applying to college, we believe that they prioritize variables important in the college admissions process. Of these variables, we predict that SAT/ACT score and family income will have a strong negative relationship, and acceptance rate will have a strong positive relationship with rank.
## Methodology
We split the first part of our analysis into two approaches. In the first, we look at the linear relationship between the numerical explanatory variables and college rank using R-squared values. In the second we build a stepwise regression model between the variables and college rank. As the variables in the final model will be most important for determining rank, we will use the model results to corroborate our results from the first approach. As we cannot find an R-squared value or an other numerical metric to measure a relationship involving a categorical variable, we will use the stepwise regression model to determine if there is a strong association between those variables and rank. In the second part of our analysis, we will combine the results of the two approaches and characterize the relationship between rank and the variables with the strongest association.
### Approach #1: Individual Numerical Variable Analysis
First, we will create linear regression models between each individual explanatory variable and rank. Then, we will calculate the R-squared value for each respective model, rank the values from highest to lowest, and select the variables with the highest R-squared values.
### Approach #2: Stepwise Regression Modeling
A stepwise regression model can manage large amounts of potential predictor variables and fine-tune the model to choose the best predictor variables from the available options. In our case, we have more than 25 variables to be examined, and thus it is crucial to have an automated workflow for model selections.
There are two main steps in this approach. First, we will create a correlation matrix to check correlation coefficients between variables. If two variables have an absolute value of r greater than 0.8, meaning they were too similar in how they factored into rankings, we only picked one of them to put into the model. Then, we will compute the stepwise regression model (both forward and backward selections) using the `MASS` package and `stepAIC()` functions for model selections based on Akaike Information Criterion (AIC). AIC is used to compare different possible models and determine which one is the best fit for the data in statistic practice. For the initial setting of the linear regression model, we will import all the valid variables into the model to predict the rank variable.
### Final Variable Analysis
We will examine the final variables selected by both approaches by first interpreting the R-squared values and graphs to characterize the linear association for each variable and rank. Then, we will calculate the linear regression slopes between each of the explanatory variables (scaled and non-scaled) and college rank. Then, we will use the scaled slopes to determine which explanatory variable has the greatest influence on college on a school having a low rank. We will interpret the relationships using the non-scaled slopes.
## Results
### Approach #1: Individual Numerical Variable Analysis
The table below gives the highest R-squared values for relationships between each individual explanatory variable in our data set and college rank. The full list of values is in the appendix.
```{r}
#| label: find-r2
numerical_variables <- c(
"ADM_RATE",
"UGDS",
"UGDS_WHITE",
"UGDS_BLACK",
"UGDS_HISP",
"UGDS_ASIAN",
"UGDS_AIAN",
"UGDS_NHPI",
"UGDS_2MOR",
"UGDS_NRA",
"UGDS_UNKN",
"NPT4_PUB",
"NPT4_PRIV",
"COSTT4_A",
"COSTT4_P",
"AVGFACSAL",
"PCTPELL",
"C150_4",
"AGE_ENTRY",
"FEMALE",
"MARRIED",
"FIRST_GEN",
"FAMINC",
"MD_FAMINC",
"ENDOWBEGIN",
"SAT_AVG",
"ACTCMMID"
)
r2s <- numeric()
for(var in numerical_variables) {
temp_df <- data |>
dplyr::select(c(rank, var))
model <- linear_reg() |>
set_engine("lm") |>
fit(rank ~ ., data = temp_df)
r2 <- glance(model)$r.squared
r2s <- c(r2s, r2)
}
r_squared_values <- tibble(
variable = numerical_variables,
r_squared = r2s
) |>
filter(variable != "COSTT4_P")
r_squared_values |>
arrange(desc(r_squared)) |>
slice(1:6) |>
kable()
```
`COSTT4_P` (Average cost of attendance for program-year institutions) has been removed because there are only two observations, resulting in an R-squared value of 1.
Average SAT (`SAT_AVG`), median ACT (`ACTCMMID`), graduation rate (`C150_4`), average faculty salary (`AVGFACSAL`), and admission rate (`ADM_RATE`) are the five variables with the strongest correlation to rank, based on their R-squared values. Therefore, they are the variables we will be examining later in our analysis. We chose five as a cutoff because there is a substantial difference between the R-squared value of these five and the next variable (`PCTPELL`).
### Approach #2: Stepwise Regression Model
**Remove Highly Correlated Variables**
```{r}
#| label: correlation-matrix
#| output: false
data_for_corr <- scaled_data |>
dplyr::select(is.numeric)
correlations = data_for_corr |>
dplyr::select(-c("NPT4_PUB", "NPT4_PRIV", "COSTT4_P")) |>
na.omit() |>
cor()
correlations |>
corrplot(method="color")
```
```{r}
#| label: display-correlation-pair-to-be-removed
correlations[correlations < 0.8 | correlations == 1] <- ""
```
| Variable Pairs with r \> 0.8 | Correlation Coefficients |
|------------------------------|--------------------------|
| C150_4, SAT_AVG | 0.8389 |
| C150_4, ACTCMMID | 0.8494 |
| AGE_ENTRY, MARRIED | 0.9059 |
| FAMINC, MD_FAMINC | 0.9538 |
| SAT_AVG, ACTCMMID | 0.9756 |
This table displays highly correlated variables. We will drop the variables `C150_4`, `MD_FAMINC`, `ACTCMMID`, `MARRIED` and preserve `SAT_AVG` and `FAMINC` to represent all other variables.
**Compute Stepwise Regression**
```{r}
#| label: stepwise-regression-model
#| fontsize: 2pt
stepwise_data <- scaled_data |>
dplyr::select(c("college", "rank", "REGION", "CONTROL", "CCBASIC", "ACCREDAGENCY", "ADM_RATE", "UGDS", "UGDS_WHITE", "UGDS_BLACK", "UGDS_HISP", "UGDS_ASIAN", "UGDS_AIAN", "UGDS_NHPI", "UGDS_2MOR", "UGDS_NRA", "UGDS_UNKN", "COSTT4_A", "AVGFACSAL", "PCTPELL", "AGE_ENTRY", "FAMINC", "ENDOWBEGIN", "SAT_AVG", "FEMALE", "FIRST_GEN"))
# Factorize variable
stepwise_data <- stepwise_data |>
mutate(
REGION = as.factor(REGION),
CONTROL = as.factor(CONTROL),
CCBASIC = as.factor(CCBASIC),
ACCREDAGENCY = as.factor(ACCREDAGENCY)
)
library(MASS)
stepwise_data = na.omit(stepwise_data)
# Fit the full model
full_model <- lm(rank ~. -college, data = stepwise_data)
# Stepwise regression model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)
# stepwise_model$anova
```
```{r}
#| label: visualize-stepwise-model
#| fig-height: 2.5
#| fig-width: 6
aic_value = stepwise_model$anova$AIC
variables = c("All", "-FAMINC", "-PCTPELL", "-ENDOWBEGIN", "-UGDS_NHPI", "-UGDS_AIAN", "-UGDS_NRA", "-UGDSHISP", "-UGDS_BLACK", "-UGDS_WHITE", "-FEMALE", "+UGDS_HISP")
aic_df = data.frame(aic_value, variables)
aic_df<- aic_df |>
mutate(variables = as.factor(variables))
aic_df <- aic_df |>
mutate(
variables = fct_relevel(variables, levels = c("All", "-FAMINC", "-PCTPELL", "-ENDOWBEGIN", "-UGDS_NHPI", "-UGDS_AIAN", "-UGDS_NRA", "-UGDSHISP", "-UGDS_BLACK", "-UGDS_WHITE", "-FEMALE", "+UGDS_HISP"))
)
aic_df |>
ggplot(aes(x = variables, y = aic_value)) +
geom_point() +
geom_line(group = 1) +
scale_y_continuous(breaks = seq(3415, 3431, by = 2)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(
title = "Variable selection process",
x = "Selection step",
y = "AIC"
)
```
The plot above displays the decreasing AIC value as variables get dropped and added. Each point represents a new iteration of the model after a variable is taken out (-) or added (+). The final iteration is the best model because its AIC value is the lowest.
**Final Model**
rank ~ REGION + CONTROL + CCBASIC + ACCREDAGENCY + ADM_RATE +
UGDS + UGDS_ASIAN + UGDS_2MOR + UGDS_UNKN + COSTT4_A + AVGFACSAL +
AGE_ENTRY + SAT_AVG + FIRST_GEN + UGDS_HISP
```{r}
#| label: r-squared-final-model
final_model_fit <- linear_reg() |>
fit(rank ~ REGION + CONTROL + CCBASIC + ACCREDAGENCY + ADM_RATE +
UGDS + UGDS_ASIAN + UGDS_2MOR + UGDS_UNKN + COSTT4_A + AVGFACSAL +
AGE_ENTRY + SAT_AVG + FIRST_GEN + UGDS_HISP, data = scaled_data)
```
Our final model has an R-squared coefficient of `r glance(final_model_fit)$r.squared` which means it accounts for a significant amount of the variation in rank.
### Final Variable Analysis
**Categorical Variable Analysis**
All four categorical variables appeared in the final model, so we can assume that they have a significant association with rank. (Note that for these graphs we dropped `NAs` and removed levels that only had one observation.)
```{r}
#| label: rank-vs-region
#| layout-nrow: 2
#| fig-height: 7
colleges_levels |>
drop_na(REGION) |>
mutate(
REGION = fct_relevel(REGION, "Plains", "Rocky Mountains", "Southwest", "Great Lakes", "Southeast", "Far West", "Mid East", "New England")
) |>
ggplot(aes(y = REGION, fill = fct_rev(level))) +
geom_bar(position = "fill") +
scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
labs(
title = "Rank vs Geographic Region of the US",
x = "Proportion",
y = "Region",
fill = "Rank"
) +
theme(
legend.position = "left",
axis.text.y = element_text(size = 14)
)
colleges_levels |>
drop_na(ACCREDAGENCY) |>
filter(
ACCREDAGENCY != "EXEMPT" &
ACCREDAGENCY != "National Association of Schools of Theatre" &
ACCREDAGENCY != "Transnational Association of Christian Colleges and Schools"
) |>
mutate(
ACCREDAGENCY = fct_relevel(ACCREDAGENCY,"Accrediting Commission of Career Schools and Colleges", "Northwest Commission on Colleges and Universities", "Higher Learning Commission", "Southern Association of Colleges and Schools Commission on Colleges", "Middle States Commission on Higher Education", "Western Association of Schools and Colleges Senior Colleges and University Commission", "New England Commission on Higher Education")) |>
ggplot(aes(y = ACCREDAGENCY, fill = fct_rev(level))) +
geom_bar(position = "fill", show.legend = FALSE) +
scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
labs(
title = "Rank vs Accrediting Agency",
x = "Proportion",
y = "Accrediting Agency",
fill = "Rank"
) +
scale_y_discrete(labels = label_wrap(30)) +
theme(axis.text.y = element_text(size = 14))
colleges_levels |>
drop_na(CONTROL) |>
mutate(
CONTROL = fct_relevel(CONTROL, "Private, For-profit", "Public", "Private, Non-profit")
) |>
ggplot(aes(y = CONTROL, fill = fct_rev(level))) +
geom_bar(position = "fill", show.legend = FALSE) +
scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
labs(
title = "Rank vs Control",
x = "Proportion",
y = "Control",
fill = "Rank"
) +
theme(axis.text.y = element_text(size = 14))
colleges_levels |>
drop_na(CCBASIC) |>
filter(
CCBASIC != "Special Focus Four-Year: Business & Management Schools" &
CCBASIC != "Baccalaureate/Associate's Colleges: Mixed Baccalaureate/Associate's" &
CCBASIC != "Special Focus Four-Year: Other Special Focus Institutions" &
CCBASIC != "Special Focus Four-Year: Other Technology-Related Schools" &
CCBASIC != "Tribal Colleges"
) |>
mutate(
CCBASIC = fct_relevel(CCBASIC, "Special Focus Four-Year: Other Health Professions Schools", "Master's Colleges & Universities: Medium Programs", "Master's Colleges & Universities: Larger Programs", "Doctoral/Professional Universities", "Special Focus Four-Year: Arts, Music & Design Schools", "Special Focus Four-Year: Faith-Related Institutions", "Special Focus Four-Year: Engineering Schools", "Baccalaureate Colleges: Diverse Fields", "Master's Colleges & Universities: Small Programs", "Doctoral Universities: High Research Activity", "Baccalaureate Colleges: Arts & Sciences Focus", "Doctoral Universities: Very High Research Activity")
) |>
ggplot(aes(y = CCBASIC, fill = fct_rev(level))) +
geom_bar(position = "fill", show.legend = FALSE) +
scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
labs(
title = "Rank vs Carnegie Classification",
x = "Proportion",
y = "Carnegie Classification (Basic)",
fill = "Rank"
) +
scale_y_discrete(labels = label_wrap(30)) +
theme(axis.text.y = element_text(size = 14))
```
The Geographic Region graph shows that New England has the highest proportion of top-100 schools, while the Plains has the lowest. Apart from the `New England` and `NA` bars, the differences in proportions of rank groups do not vary dramatically between bars. It is possible that the strength of the correlation between rank and region is driven in large part by the association New England has with schools with the lowest 100 ranks. Additionally, as accrediting agency is often based on location, it reflects results similar to the region graph.
There are only 4 `Private, For-profit` schools in the top 500, and all of them are ranked between 301 and 400. The proportions of ranks between `Private, Non-profit` and `Public` are similar, although the first appears to have a larger proportion of `1-100` schools, and the latter a higher proportion of `401-500` schools.
There appears to be the greatest differences between bars of proportions of rank groups in the Carnegie Classification group, suggesting that this has the strongest association with rank. The lower the rank, the higher proportion of schools in `Doctoral Universities: Very High Research Activity` and `Baccalaureate Colleges: Arts & Sciences Focus`. However, the opposite appears to be true for all other classifications with 3 or more rank categories.
**Numerical Variable Analysis**
All of the variables with the top 5 R-squared values appeared in the final stepwise regression model except for `ACTCMMID` (median ACT score) and `C150_4` (graduation rate). These were not used in the model because both of them had a high correlation with `SAT_AVG` (average SAT). Because `SAT_AVG` ended up in the final model, we can reasonably assume that they also have a strong association with rank based on the model's selection process. Therefore, we conclude that `SAT_AVG`, `ACTCMMID`, `C150_4`, `AVGFACSAL`, and `ADM_RATE` have the strongest association with rank, and we will characterize the relationships below.
```{r}