report.qmd

---
title: "Factors Most Important to University Ranking"
subtitle: "Report"
format: html
editor: visual
execute:
  echo: false
message: false
warning: false
editor_options:
  chunk_output_type: console
---

```{r}
#| label: load-pkgs
#| message: false

library(tidyverse)
library(corrplot)
library(tidymodels)
library(scales)
library(ggthemes)
library(gridExtra)
library(knitr)
```

## Introduction

### Research Background

Every year, millions of students apply to colleges across the United States, and many of them use college rankings lists from sources such as Niche.com to help them decide where to apply. In recent years, these lists have been heavily criticized as they encourage universities to prioritize certain metrics and manipulate the system to raise their ranks. It is important for students to know where these rankings come from and what they actually measure. In this project, we will explore how influential different metrics are in determining a college's ranking. For reference, when we say a "low rank," we refer to schools with a lower numerical rank, such as #1. When we say a "high rank," we refer to schools with a high numerical rank, such as #500.

### Data

```{r}
#| label: import

niche_data_500 <- read_csv("data/niche_data_500.csv")
us_dep_of_ed <- read_csv("data/us_dep_of_ed.csv", na = c("NULL", "PrivacySuppressed"))
```

For our analysis, we joined two datasets:

**Data Set #1 - Niche:** Niche's "2023 Best Colleges in America" list aggregates data from a variety of sources, including the US Department of Education and reviews from students and alumni, and is updated monthly. The Niche data was scraped by Maia on October 17-19 2022. There are `r nrow(niche_data_500)` observations, representing the top 500 schools in the United States. Each observation has two variables: `college` (institution name) and `rank`.

**Data Set #2 - US Department of Education:** The second data set comes from the US Department of Education's College Scorecard, which is an exhaustive summary of characteristics and statistics for all colleges and universities in the United States. The College Scorecard is updated by the Education Department as it collects new data, but most of the data comes from the 2020-2021 school year. Data used in the scorecard comes from data reported by the institutions, data on federal financial aid, data from taxes, and data from other federal agencies. There were 2,989 variables in the original data set, many of which we don't need to answer our question, so we selected `r ncol(us_dep_of_ed)` variables before importing into RStudio. We further narrowed to the 31 variables we thought could have the most influence on rank. There are `r nrow(us_dep_of_ed)` observations, representing all US colleges and universities.

**Variable Summary:**

-   `college`: Institution name (Categorical)
-   `rank`: Rank of school on Niche list (Quantitative)
-   `REGION`: US geographic region (C)
-   `ACCREDAGENCY`: Accrediting agency for institution (C)
-   `CONTROL`: Public, Private nonprofit, or Private for-profit (C)
-   `CCBASIC`: Carnegie Classification (basic) (C)
-   `ADM_RATE`: Admission rate (Q)
-   `UGDS`: Enrollment of undergraduate certificate/degree-seeking students (Q)
-   `UGDS_WHITE`, `UGDS_BLACK`, `UGDS_HISP`, `UGDS_ASIAN`, `UGDS_AIAN`, `UGDS_NHPI`, `UGDS_2MOR`, and `UGDS_UNKN`: Enrollment of undergraduate students of each racial/ethnic group (Q)
-   `NPT4_PUB`, `NPT4_PRIV`: Average net price for Title IV institutions (public and private) (Q)
-   `COSTT4_A`, `COSTT4_P`: Average cost of attendance (academic- and program-year institutions) (Q)
-   `AVGFACSAL`: Average faculty salary (Q)
-   `PCTPELL`: Percentage of undergraduates who receive a Pell Grant (Q)
-   `C150_4`: Completion rate for first-time, full-time students at four-year institutions (Q)
-   `AGE_ENTRY`: Average age of entry (Q)
-   `FEMALE`: Share of female students (Q)
-   `MARRIED`: Share of married students (Q)
-   `FIRST_GEN`: Share of first-generation students (Q)
-   `FAMINC`, `MD_FAMINC`: Average and median family income (Q)
-   `ENDOWBEGIN`: Value of school's endowment at the beginning of the fiscal year (Q)
-   `SAT_AVG`: Average SAT equivalent score of students admitted (Q)
-   `ACTCMMID`: Midpoint of the ACT cumulative score (Q)

#### Data Preparation

1.  To get the data, we scraped from Niche.com and downloaded data from the US Department of Education. The steps were done in `niche-scrape.R`
2.  Naming discrepancies between the two data sets were fixed to ensure as successful merge. `University of South Florida-Sarasota-Manatee` and `University of South Florida-St. Petersburg` were removed because they did not exist in both datasets.

```{r}
#| label: fix-college-names

us_dep_of_ed <- us_dep_of_ed |>
  mutate(
    INSTNM = case_when(
      INSTNM == "Columbia University in the City of New York" ~
        "Columbia University",

      INSTNM == "Washington University in St Louis" ~
        "Washington University in St. Louis",

      INSTNM == "University of California-Los Angeles" ~
        "University of California - Los Angeles",

      INSTNM == "University of Michigan-Ann Arbor" ~
        "University of Michigan - Ann Arbor",

      INSTNM == "Georgia Institute of Technology-Main Campus" ~
        "Georgia Institute of Technology",

      INSTNM == "University of Virginia-Main Campus" ~
        "University of Virginia",

      INSTNM == "United States Military Academy" ~
        "United States Military Academy at West Point",

      INSTNM == "The University of Texas at Austin" ~
        "University of Texas - Austin",

      INSTNM == "University of California-Berkeley" ~
        "University of California - Berkeley",

      INSTNM == "University of California-Irvine" ~
        "University of California - Irvine",

      INSTNM == "University of California-San Diego" ~
        "University of California - San Diego",

      INSTNM == "University of California-Davis" ~
        "University of California - Davis",

      INSTNM == "University of California-Santa Barbara" ~
        "University of California - Santa Barbara",

      INSTNM == "University of California-Santa Cruz" ~
        "University of California - Santa Cruz",

      INSTNM == "University of California-Riverside" ~
        "University of California - Riverside",

      INSTNM == "University of North Carolina Wilmington" ~
        "University of North Carolina - Wilmington",

      INSTNM == "University of North Carolina at Greensboro" ~
        "University of North Carolina - Greensboro",

      INSTNM == "Albany College of Pharmacy and Health Sciences" ~
        "Albany College of Pharmacy & Health Sciences",

      INSTNM == "Arizona State University Campus Immersion" ~
        "Arizona State University",

      INSTNM == "Arizona State University-Downtown Phoenix" ~
        "Arizona State University - Downtown Phoenix Campus",

      INSTNM == "Augustana College" ~
        "Augustana College - Illinois",

      UNITID == "173160" ~
        "Bethel University - Minnesota",

      INSTNM == "Binghamton University" ~
        "Binghamton University, SUNY",

      INSTNM == "Bowling Green State University-Main Campus" ~
        "Bowling Green State University",

      INSTNM == "California Polytechnic State University-San Luis Obispo" ~
        "California Polytechnic State University (Cal Poly) - San Luis Obispo",

      INSTNM == "California State University-Fullerton" ~
        "California State University - Fullerton",

      INSTNM == "California State University-Long Beach" ~
        "California State University - Long Beach",

      INSTNM == "The College of Wooster" ~
        "College of Wooster",

      INSTNM == "Colorado State University-Fort Collins" ~
        "Colorado State University",

      INSTNM == "Concordia University-Wisconsin" ~
        "Concordia University - Wisconsin",

      INSTNM == "CUNY Bernard M Baruch College" ~
        "CUNY Baruch College",

      INSTNM == "CUNY City College" ~
        "CUNY City College of New York",

      INSTNM == "D'Youville College" ~
        "D'Youville",

      INSTNM == "Eastern New Mexico University-Main Campus" ~
        "Eastern New Mexico University",

      INSTNM == "Embry-Riddle Aeronautical University-Daytona Beach" ~
        "Embry-Riddle Aeronautical University - Daytona Beach",

      INSTNM == "Embry-Riddle Aeronautical University-Worldwide" ~
        "Embry-Riddle Aeronautical University - Worldwide",

      INSTNM == "Florida Agricultural and Mechanical University" ~
        "Florida A&M University",

      INSTNM == "Franklin and Marshall College" ~
        "Franklin & Marshall College",

      INSTNM == "Hobart William Smith Colleges" ~
        "Hobart and William Smith Colleges",

      INSTNM == "Indiana University-Bloomington" ~
        "Indiana University - Bloomington",

      INSTNM == "Indiana University-East" ~
        "Indiana University - East",

      INSTNM == "Indiana University-Purdue University-Indianapolis" ~
        "Indiana University-Purdue University - Indianapolis (IUPUI)",

      INSTNM == "Indiana Wesleyan University-Marion" ~
        "Indiana Wesleyan University",

      INSTNM == "Keiser University-Ft Lauderdale" ~
        "Keiser University - Fort Lauderdale",

      INSTNM == "Kent State University at Kent" ~
        "Kent State University",

      INSTNM == "Louisiana State University and Agricultural & Mechanical College" ~
        "Louisiana State University",

      UNITID == "151786" ~
        "Marian University Indianapolis",

      INSTNM == "Maryville University of Saint Louis" ~
        "Maryville University",

      INSTNM == "Metropolitan State University" ~
        "Metropolitan State University - Minnesota",

      INSTNM == "Miami University-Oxford" ~
        "Miami University",

      INSTNM == "Minnesota State University-Mankato" ~
        "Minnesota State University, Mankato",

      INSTNM == "Molloy College" ~
        "Molloy University",

      INSTNM == "Monroe College" ~
        "Monroe College - Bronx/New Rochelle",

      INSTNM == "Mount Saint Mary's University" ~
        "Mount Saint Mary's University Los Angeles",

      INSTNM == "New Mexico State University-Main Campus" ~
        "New Mexico State University",

      INSTNM == "New Mexico Institute of Mining and Technology" ~
        "New Mexico Tech",

      INSTNM == "North Carolina State University at Raleigh" ~
        "North Carolina State University",

      INSTNM == "North Dakota State University-Main Campus" ~
        "North Dakota State University",

      UNITID ==  "154101" ~
        "Northwestern College - Iowa",

      INSTNM == "Ohio University-Main Campus" ~
        "Ohio University",

      INSTNM == "Oklahoma State University-Main Campus" ~
        "Oklahoma State University",

      INSTNM == "Pacific University" ~
        "Pacific University Oregon",

      INSTNM == "The Pennsylvania State University" ~
        "Penn State",

      INSTNM == "Pennsylvania State University-Penn State Fayette- Eberly" ~
        "Penn State Fayette, The Eberly Campus",

      INSTNM == "Pennsylvania State University-Penn State York" ~
        "Penn State York",

      INSTNM == "Purdue University-Main Campus" ~
        "Purdue University",

      INSTNM == "Rutgers University-New Brunswick" ~
        "Rutgers University - New Brunswick",

      INSTNM == "Rutgers University-Newark" ~
        "Rutgers University - Newark",

      INSTNM == "The University of the South" ~
        "Sewanee - The University of the South",

      INSTNM == "South Dakota School of Mines and Technology" ~
        "South Dakota School of Mines & Technology",

      INSTNM == "St Bonaventure University" ~
        "St. Bonaventure University",

      INSTNM == "Saint John Fisher College" ~
        "St. John Fisher University",

      INSTNM == "St. Joseph's University-New York" ~
        "St. Joseph's University - New York",

      INSTNM == "St Lawrence University" ~
        "St. Lawrence University",

      INSTNM == "St Olaf College" ~
        "St. Olaf College",

      INSTNM == "Stanbridge University" ~
        "Stanbridge University - Orange County",

      INSTNM == "Stephen F Austin State University" ~
        "Stephen F. Austin State University",

      INSTNM == "Stony Brook University" ~
        "Stony Brook University, SUNY",

      INSTNM == "SUNY College of Environmental Science and Forestry" ~
        "SUNY College of Environmental Science & Forestry",

      INSTNM == "Farmingdale State College" ~
        "SUNY Farmingdale State College",

      INSTNM == "State University of New York at New Paltz" ~
        "SUNY New Paltz",

      INSTNM == "Texas A & M University-College Station" ~
        "Texas A&M University",

      INSTNM == "American Musical and Dramatic Academy" ~
        "The American Musical and Dramatic Academy (AMDA) - New York",

      INSTNM == "The College of Saint Scholastica" ~
        "The College of St. Scholastica",

      INSTNM == "Cooper Union for the Advancement of Science and Art" ~
        "The Cooper Union for the Advancement of Science and Art",

      INSTNM == "The Master's University and Seminary" ~
        "The Master's University",

      INSTNM == "Ohio State University-Main Campus" ~
        "The Ohio State University",

      INSTNM == "University of Alabama in Huntsville" ~
        "The University of Alabama in Huntsville",

      INSTNM == "University of Baltimore" ~
        "The University of Baltimore",

      INSTNM == "University of Tulsa" ~
        "The University of Tulsa",

      INSTNM == "University of Virginia's College at Wise" ~
        "The University of Virginia's College at Wise",

      INSTNM == "Trinity College" ~
        "Trinity College - Connecticut",

      INSTNM == "Tulane University of Louisiana" ~
        "Tulane University",

      UNITID == "196866" ~
        "Union College - New York",

      INSTNM == "University at Buffalo" ~
        "University at Buffalo, SUNY",

      INSTNM == "University of Alabama at Birmingham" ~
        "University of Alabama - Birmingham",

      INSTNM == "University of Cincinnati-Main Campus" ~
        "University of Cincinnati",

      INSTNM == "University of Colorado Denver/Anschutz Medical Campus" ~
        "University of Colorado Denver",

      INSTNM == "University of Maryland-College Park" ~
        "University of Maryland - College Park",

      INSTNM == "University of Maryland-Baltimore County" ~
        "University of Maryland, Baltimore County",

      INSTNM == "University of Massachusetts-Amherst" ~
        "University of Massachusetts - Amherst",

      INSTNM == "University of Massachusetts-Lowell" ~
        "University of Massachusetts Lowell",

      INSTNM == "University of Michigan-Dearborn" ~
        "University of Michigan - Dearborn",

      INSTNM == "University of Minnesota-Crookston" ~
        "University of Minnesota Crookston",

      INSTNM == "University of Minnesota-Duluth" ~
        "University of Minnesota Duluth",

      INSTNM == "University of Minnesota-Twin Cities" ~
        "University of Minnesota Twin Cities",

      INSTNM == "University of Missouri-Columbia" ~
        "University of Missouri",

      INSTNM == "Embry-Riddle Aeronautical University-Prescott" ~
        "Embry-Riddle Aeronautical University - Prescott",

      INSTNM == "South Dakota State University" ~
        "South Dakota State University",

      INSTNM == "St Catherine University" ~
        "St. Catherine University",

      INSTNM == "University of Missouri-Kansas City" ~
        "University of Missouri - Kansas City",

      INSTNM == "University of Missouri-St Louis" ~
        "University of Missouri - St. Louis",

      INSTNM == "University of Nebraska-Lincoln" ~
        "University of Nebraska - Lincoln",

      INSTNM == "University of Nevada-Reno" ~
        "University of Nevada - Reno",

      INSTNM == "University of New Hampshire-Main Campus" ~
        "University of New Hampshire",

      INSTNM == "University of Oklahoma-Norman Campus" ~
        "University of Oklahoma",

      INSTNM == "University of Pittsburgh-Pittsburgh Campus" ~
        "University of Pittsburgh",

      INSTNM == "University of South Carolina-Columbia" ~
        "University of South Carolina",

      INSTNM == "University of St Francis" ~
        "University of St. Francis - Illinois",

      UNITID == "174914" ~
        "University of St. Thomas - Minnesota",

      UNITID == "227863" ~
        "University of St. Thomas - Texas",

      INSTNM == "The University of Tennessee-Knoxville" ~
        "University of Tennessee",

      INSTNM == "The University of Tennessee-Martin" ~
        "University of Tennessee at Martin",

      INSTNM == "The University of Texas at Arlington" ~
        "University of Texas - Arlington",

      INSTNM == "The University of Texas at Dallas" ~
        "University of Texas - Dallas",

      INSTNM == "The University of Texas Permian Basin" ~
        "University of Texas - Permian Basin",

      INSTNM == "The University of Texas Rio Grande Valley" ~
        "University of Texas - Rio Grande Valley",

      INSTNM == "The University of Texas at Tyler" ~
        "University of Texas - Tyler",

      INSTNM == "University of Washington-Seattle Campus" ~
        "University of Washington",

      INSTNM == "University of Washington-Bothell Campus" ~
        "University of Washington - Bothell",

      INSTNM == "University of Washington-Tacoma Campus" ~
        "University of Washington - Tacoma",

      INSTNM == "The University of West Florida" ~
        "University of West Florida",

      INSTNM == "University of Wisconsin-Madison" ~
        "University of Wisconsin",

      INSTNM == "University of Wisconsin-La Crosse" ~
        "University of Wisconsin - La Crosse",

      INSTNM == "Virginia Polytechnic Institute and State University" ~
        "Virginia Tech",

      INSTNM == "West Texas A & M University" ~
        "West Texas A&M University",

      UNITID == "230807" ~
        "Westminster College - Utah",

      INSTNM == "Wheaton College" ~
        "Wheaton College - Illinois",

      INSTNM == "Wheaton College (Massachusetts)" ~
        "Wheaton College - Massachusetts",

      TRUE ~ INSTNM
    )
  )
```

3.  The data sets were joined using the `left_join()` function. The resulting data set has 498 observations with 33 variables.

```{r}
#| label: join-datasets

colleges <- niche_data_500 |>
  left_join(us_dep_of_ed, by = c("college" = "INSTNM")) |>
  filter(rank != 141) |>
  filter(rank != 181)
```

```{r}
#| label: select-variables

chosen_variables <- c(
  "college",
  "rank",
  "REGION",
  "ACCREDAGENCY",
  "CONTROL",
  "CCBASIC",
  "ADM_RATE",
  "UGDS",
  "UGDS_WHITE",
  "UGDS_BLACK",
  "UGDS_HISP",
  "UGDS_ASIAN",
  "UGDS_AIAN",
  "UGDS_NHPI",
  "UGDS_2MOR",
  "UGDS_NRA",
  "UGDS_UNKN",
  "NPT4_PUB",
  "NPT4_PRIV",
  "COSTT4_A",
  "COSTT4_P",
  "AVGFACSAL",
  "PCTPELL",
  "C150_4",
  "AGE_ENTRY",
  "FEMALE",
  "MARRIED",
  "FIRST_GEN",
  "FAMINC",
  "MD_FAMINC",
  "ENDOWBEGIN",
  "SAT_AVG",
  "ACTCMMID"
)

data <- colleges |>
  dplyr::select(chosen_variables)
```

4.  Categorical variables whose levels were listed as numbers were updated to reflect their interpretable levels.

```{r}
#| label: clean-categorical-variables

data <- data |>
  mutate(
    REGION = as.character(REGION),
    REGION = case_when(
      REGION == 1 ~ "New England",
      REGION == 2 ~ "Mid East",
      REGION == 3 ~ "Great Lakes",
      REGION == 4 ~ "Plains",
      REGION == 5 ~ "Southeast",
      REGION == 6 ~ "Southwest",
      REGION == 7 ~ "Rocky Mountains",
      REGION == 8 ~ "Far West",
      REGION == 9 ~ "Outlying Areas"
    ),
    CONTROL = as.character(CONTROL),
    CONTROL = case_when(
      CONTROL == 1 ~ "Public",
      CONTROL == 2 ~ "Private, Non-profit",
      CONTROL == 3 ~ "Private, For-profit"
    ),
    CCBASIC = as.character(CCBASIC),
    CCBASIC = case_when(
      CCBASIC == -2 ~ "Not applicable",
      CCBASIC == 0 ~ "Not classified",
      CCBASIC == 1 ~ "Associate's Colleges: High Transfer-High Traditional",
      CCBASIC == 2 ~ "Associate's Colleges: High Transfer-Mixed Traditional/Nontraditional",
      CCBASIC == 3 ~ "Associate's Colleges: High Transfer-High Nontraditional",
      CCBASIC == 4 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-High Traditional",
      CCBASIC == 5 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-Mixed Traditional/Nontraditional",
      CCBASIC == 6 ~ "Associate's Colleges: Mixed Transfer/Career & Technical-High Nontraditional",
      CCBASIC == 7 ~ "Associate's Colleges: High Career & Technical-High Traditional",
      CCBASIC == 8 ~ "Associate's Colleges: High Career & Technical-Mixed Traditional/Nontraditional",
      CCBASIC == 9 ~ "Associate's Colleges: High Career & Technical-High Nontraditional",
      CCBASIC == 10 ~ "Special Focus Two-Year: Health Professions",
      CCBASIC == 11 ~ "Special Focus Two-Year: Technical Professions",
      CCBASIC == 12 ~ "Special Focus Two-Year: Arts & Design",
      CCBASIC == 13 ~ "Special Focus Two-Year: Other Fields",
      CCBASIC == 14 ~ "Baccalaureate/Associate's Colleges: Associate's Dominant",
      CCBASIC == 15 ~ "Doctoral Universities: Very High Research Activity",
      CCBASIC == 16 ~ "Doctoral Universities: High Research Activity",
      CCBASIC == 17 ~ "Doctoral/Professional Universities",
      CCBASIC == 18 ~ "Master's Colleges & Universities: Larger Programs",
      CCBASIC == 19 ~ "Master's Colleges & Universities: Medium Programs",
      CCBASIC == 20 ~ "Master's Colleges & Universities: Small Programs",
      CCBASIC == 21 ~ "Baccalaureate Colleges: Arts & Sciences Focus",
      CCBASIC == 22 ~ "Baccalaureate Colleges: Diverse Fields",
      CCBASIC == 23 ~ "Baccalaureate/Associate's Colleges: Mixed Baccalaureate/Associate's",
      CCBASIC == 24 ~ "Special Focus Four-Year: Faith-Related Institutions",
      CCBASIC == 25 ~ "Special Focus Four-Year: Medical Schools & Centers",
      CCBASIC == 26 ~ "Special Focus Four-Year: Other Health Professions Schools",
      CCBASIC == 27 ~ "Special Focus Four-Year: Engineering Schools",
      CCBASIC == 28 ~ "Special Focus Four-Year: Other Technology-Related Schools",
      CCBASIC == 29 ~ "Special Focus Four-Year: Business & Management Schools",
      CCBASIC == 30 ~ "Special Focus Four-Year: Arts, Music & Design Schools",
      CCBASIC == 31 ~ "Special Focus Four-Year: Law Schools",
      CCBASIC == 32 ~ "Special Focus Four-Year: Other Special Focus Institutions",
      CCBASIC == 33 ~ "Tribal Colleges"
    )
  )
```

5.  All of the numerical variables are on different scales, so we created a scaled version of the numerical explanatory variables, with mean 0 and standard deviation 1.

```{r}
#| label: scale-numerical-variables

scaled_continuous_numeric_variables <- data |>
  select_if(is.numeric) |>
  dplyr::select(-rank) |>
  scale() |>
  data.frame()

other_variables <- data |>
  dplyr::select(rank | REGION | CONTROL | CCBASIC | !where(is.numeric))

scaled_data <- cbind(other_variables, scaled_continuous_numeric_variables) |>
  relocate(college)

```

#### Exploratory Data Analysis

**Means of Selected Numerical Variables by Rank Group**

```{r}
#| label: exploration-by-rank-group

colleges_levels <- data |>
  mutate(
    level = case_when(
      rank <= 100 ~ "1-100",
      rank > 100 & rank <= 200 ~ "101-200",
      rank > 200 & rank <= 300 ~ "201-300",
      rank > 300 & rank <= 400 ~ "301-400",
      rank > 400 & rank <= 500 ~ "401-500",
    ),
    .after = rank
  )

colleges_levels |>
  group_by(level) |>
  summarize(
    mean_ADM_Rate = mean(ADM_RATE, na.rm = TRUE),
    mean_SAT_AVG = mean(SAT_AVG, na.rm = TRUE),
    mean_ACTCMMID = mean(ACTCMMID, na.rm = TRUE),
    mean_UGDS_WHITE = mean(UGDS_WHITE, na.rm = TRUE),
    mean_UGDS_ASIAN = mean(UGDS_ASIAN, na.rm = TRUE),
    mean_FAMINC = mean(FAMINC, na.rm = TRUE)
  ) |>
  rename(
    "Interval" = level,
    "Mean Admission Rate" = mean_ADM_Rate,
    "Mean SAT Average" = mean_SAT_AVG,
    "Mean ACT Median" =  mean_ACTCMMID,
    "Mean % White Students" = mean_UGDS_WHITE,
    "Mean % Asian Students" = mean_UGDS_ASIAN,
    "Mean Family Income" = mean_FAMINC
  ) |>
  kable()
```

As the rank group gets higher, the mean admission rate increases, and the mean SAT Average, ACT Median, and family income decreases. 1-100 ranked schools have fewer White students and more Asian students than all other rank groups. We observed a relationship between the categorical variables and rank and included that exploratory analysis in the appendix.

### Research Question and Hypothesis

**Question:** Which characteristics of a university are most associated with rankings on the Niche College Ranking list? Of these characteristics, what is the relationship between high and low rank?

**Hypothesis:** We hypothesize that SAT/ACT scores, acceptance rate, and family income will have the strongest association with rank because since Niche's audience is in large part students applying to college, we believe that they prioritize variables important in the college admissions process. Of these variables, we predict that SAT/ACT score and family income will have a strong negative relationship, and acceptance rate will have a strong positive relationship with rank.

## Methodology

We split the first part of our analysis into two approaches. In the first, we look at the linear relationship between the numerical explanatory variables and college rank using R-squared values. In the second we build a stepwise regression model between the variables and college rank. As the variables in the final model will be most important for determining rank, we will use the model results to corroborate our results from the first approach. As we cannot find an R-squared value or an other numerical metric to measure a relationship involving a categorical variable, we will use the stepwise regression model to determine if there is a strong association between those variables and rank. In the second part of our analysis, we will combine the results of the two approaches and characterize the relationship between rank and the variables with the strongest association.

### Approach #1: Individual Numerical Variable Analysis

First, we will create linear regression models between each individual explanatory variable and rank. Then, we will calculate the R-squared value for each respective model, rank the values from highest to lowest, and select the variables with the highest R-squared values.

### Approach #2: Stepwise Regression Modeling

A stepwise regression model can manage large amounts of potential predictor variables and fine-tune the model to choose the best predictor variables from the available options. In our case, we have more than 25 variables to be examined, and thus it is crucial to have an automated workflow for model selections.

There are two main steps in this approach. First, we will create a correlation matrix to check correlation coefficients between variables. If two variables have an absolute value of r greater than 0.8, meaning they were too similar in how they factored into rankings, we only picked one of them to put into the model. Then, we will compute the stepwise regression model (both forward and backward selections) using the `MASS` package and `stepAIC()` functions for model selections based on Akaike Information Criterion (AIC). AIC is used to compare different possible models and determine which one is the best fit for the data in statistic practice. For the initial setting of the linear regression model, we will import all the valid variables into the model to predict the rank variable.

### Final Variable Analysis

We will examine the final variables selected by both approaches by first interpreting the R-squared values and graphs to characterize the linear association for each variable and rank. Then, we will calculate the linear regression slopes between each of the explanatory variables (scaled and non-scaled) and college rank. Then, we will use the scaled slopes to determine which explanatory variable has the greatest influence on college on a school having a low rank. We will interpret the relationships using the non-scaled slopes.

## Results

### Approach #1: Individual Numerical Variable Analysis

The table below gives the highest R-squared values for relationships between each individual explanatory variable in our data set and college rank. The full list of values is in the appendix.

```{r}
#| label: find-r2

numerical_variables  <- c(
  "ADM_RATE",
  "UGDS",
  "UGDS_WHITE",
  "UGDS_BLACK",
  "UGDS_HISP",
  "UGDS_ASIAN",
  "UGDS_AIAN",
  "UGDS_NHPI",
  "UGDS_2MOR",
  "UGDS_NRA",
  "UGDS_UNKN",
  "NPT4_PUB",
  "NPT4_PRIV",
  "COSTT4_A",
  "COSTT4_P",
  "AVGFACSAL",
  "PCTPELL",
  "C150_4",
  "AGE_ENTRY",
  "FEMALE",
  "MARRIED",
  "FIRST_GEN",
  "FAMINC",
  "MD_FAMINC",
  "ENDOWBEGIN",
  "SAT_AVG",
  "ACTCMMID"
)

r2s <- numeric()

for(var in numerical_variables) {
  temp_df <- data |>
  dplyr::select(c(rank, var))

  model <- linear_reg() |>
    set_engine("lm") |>
    fit(rank ~ ., data = temp_df)

  r2 <- glance(model)$r.squared

  r2s <- c(r2s, r2)
}

r_squared_values <- tibble(
  variable = numerical_variables,
  r_squared = r2s
  ) |>
  filter(variable != "COSTT4_P")

r_squared_values |>
  arrange(desc(r_squared)) |>
  slice(1:6) |>
  kable()
```

`COSTT4_P` (Average cost of attendance for program-year institutions) has been removed because there are only two observations, resulting in an R-squared value of 1.

Average SAT (`SAT_AVG`), median ACT (`ACTCMMID`), graduation rate (`C150_4`), average faculty salary (`AVGFACSAL`), and admission rate (`ADM_RATE`) are the five variables with the strongest correlation to rank, based on their R-squared values. Therefore, they are the variables we will be examining later in our analysis. We chose five as a cutoff because there is a substantial difference between the R-squared value of these five and the next variable (`PCTPELL`).

### Approach #2: Stepwise Regression Model

**Remove Highly Correlated Variables**

```{r}
#| label: correlation-matrix
#| output: false

data_for_corr <- scaled_data |>
  dplyr::select(is.numeric)

correlations = data_for_corr |>
  dplyr::select(-c("NPT4_PUB", "NPT4_PRIV", "COSTT4_P")) |>
    na.omit() |>
    cor()

correlations |>
    corrplot(method="color")
```

```{r}
#| label: display-correlation-pair-to-be-removed

correlations[correlations < 0.8 | correlations == 1] <- ""
```

| Variable Pairs with r \> 0.8 | Correlation Coefficients |
|------------------------------|--------------------------|
| C150_4, SAT_AVG              | 0.8389                   |
| C150_4, ACTCMMID             | 0.8494                   |
| AGE_ENTRY, MARRIED           | 0.9059                   |
| FAMINC, MD_FAMINC            | 0.9538                   |
| SAT_AVG, ACTCMMID            | 0.9756                   |

This table displays highly correlated variables. We will drop the variables `C150_4`, `MD_FAMINC`, `ACTCMMID`, `MARRIED` and preserve `SAT_AVG` and `FAMINC` to represent all other variables.

**Compute Stepwise Regression**

```{r}
#| label: stepwise-regression-model
#| fontsize: 2pt

stepwise_data <- scaled_data |>
  dplyr::select(c("college", "rank", "REGION", "CONTROL", "CCBASIC", "ACCREDAGENCY", "ADM_RATE", "UGDS", "UGDS_WHITE", "UGDS_BLACK", "UGDS_HISP", "UGDS_ASIAN", "UGDS_AIAN", "UGDS_NHPI", "UGDS_2MOR", "UGDS_NRA", "UGDS_UNKN", "COSTT4_A", "AVGFACSAL", "PCTPELL", "AGE_ENTRY", "FAMINC", "ENDOWBEGIN", "SAT_AVG", "FEMALE", "FIRST_GEN"))

# Factorize variable
stepwise_data <- stepwise_data |>
  mutate(
    REGION = as.factor(REGION),
    CONTROL = as.factor(CONTROL),
    CCBASIC = as.factor(CCBASIC),
    ACCREDAGENCY = as.factor(ACCREDAGENCY)
    )
library(MASS)
stepwise_data = na.omit(stepwise_data)
# Fit the full model
full_model <- lm(rank ~. -college, data = stepwise_data)
# Stepwise regression model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# stepwise_model$anova
```

```{r}
#| label: visualize-stepwise-model
#| fig-height: 2.5
#| fig-width: 6

aic_value = stepwise_model$anova$AIC

variables = c("All", "-FAMINC", "-PCTPELL", "-ENDOWBEGIN", "-UGDS_NHPI", "-UGDS_AIAN", "-UGDS_NRA", "-UGDSHISP", "-UGDS_BLACK", "-UGDS_WHITE", "-FEMALE", "+UGDS_HISP")

aic_df = data.frame(aic_value, variables)

aic_df<- aic_df |>
  mutate(variables = as.factor(variables))

aic_df <- aic_df |>
  mutate(
    variables = fct_relevel(variables, levels = c("All", "-FAMINC", "-PCTPELL", "-ENDOWBEGIN", "-UGDS_NHPI", "-UGDS_AIAN", "-UGDS_NRA", "-UGDSHISP", "-UGDS_BLACK", "-UGDS_WHITE", "-FEMALE", "+UGDS_HISP"))
  )

aic_df |>
  ggplot(aes(x = variables, y = aic_value)) +
  geom_point() +
  geom_line(group = 1) +
  scale_y_continuous(breaks = seq(3415, 3431, by = 2)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(
    title = "Variable selection process",
    x = "Selection step",
    y = "AIC"
  )

```

The plot above displays the decreasing AIC value as variables get dropped and added. Each point represents a new iteration of the model after a variable is taken out (-) or added (+). The final iteration is the best model because its AIC value is the lowest.

**Final Model**

    rank ~ REGION + CONTROL + CCBASIC + ACCREDAGENCY + ADM_RATE +
        UGDS + UGDS_ASIAN + UGDS_2MOR + UGDS_UNKN + COSTT4_A + AVGFACSAL +
        AGE_ENTRY + SAT_AVG + FIRST_GEN + UGDS_HISP

```{r}
#| label: r-squared-final-model

final_model_fit <- linear_reg() |>
  fit(rank ~ REGION + CONTROL + CCBASIC + ACCREDAGENCY + ADM_RATE +
    UGDS + UGDS_ASIAN + UGDS_2MOR + UGDS_UNKN + COSTT4_A + AVGFACSAL +
    AGE_ENTRY + SAT_AVG + FIRST_GEN + UGDS_HISP, data = scaled_data)

```

Our final model has an R-squared coefficient of `r glance(final_model_fit)$r.squared` which means it accounts for a significant amount of the variation in rank.

### Final Variable Analysis

**Categorical Variable Analysis**

All four categorical variables appeared in the final model, so we can assume that they have a significant association with rank. (Note that for these graphs we dropped `NAs` and removed levels that only had one observation.)

```{r}
#| label: rank-vs-region
#| layout-nrow: 2
#| fig-height: 7

colleges_levels |>
  drop_na(REGION) |>
  mutate(
    REGION = fct_relevel(REGION, "Plains", "Rocky Mountains", "Southwest", "Great Lakes", "Southeast",  "Far West", "Mid East", "New England")
  ) |>
  ggplot(aes(y = REGION, fill = fct_rev(level))) +
  geom_bar(position = "fill") +
  scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
  labs(
    title = "Rank vs Geographic Region of the US",
    x = "Proportion",
    y = "Region",
    fill = "Rank"
  ) +
  theme(
    legend.position = "left",
    axis.text.y = element_text(size = 14)
  )

colleges_levels |>
  drop_na(ACCREDAGENCY) |>
   filter(
    ACCREDAGENCY != "EXEMPT" &
    ACCREDAGENCY != "National Association of Schools of Theatre" &
    ACCREDAGENCY != "Transnational Association of Christian Colleges and Schools"
  ) |>
  mutate(
  ACCREDAGENCY = fct_relevel(ACCREDAGENCY,"Accrediting Commission of Career Schools and Colleges", "Northwest Commission on Colleges and Universities", "Higher Learning Commission", "Southern Association of Colleges and Schools Commission on Colleges", "Middle States Commission on Higher Education", "Western Association of Schools and Colleges Senior Colleges and University Commission", "New England Commission on Higher Education")) |>
  ggplot(aes(y = ACCREDAGENCY, fill = fct_rev(level))) +
  geom_bar(position = "fill", show.legend = FALSE) +
  scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
  labs(
    title = "Rank vs Accrediting Agency",
    x = "Proportion",
    y = "Accrediting Agency",
    fill = "Rank"
  ) +
  scale_y_discrete(labels = label_wrap(30)) +
  theme(axis.text.y = element_text(size = 14))

colleges_levels |>
  drop_na(CONTROL) |>
  mutate(
    CONTROL = fct_relevel(CONTROL, "Private, For-profit", "Public", "Private, Non-profit")
  ) |>
  ggplot(aes(y = CONTROL, fill = fct_rev(level))) +
  geom_bar(position = "fill", show.legend = FALSE) +
  scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
  labs(
    title = "Rank vs Control",
    x = "Proportion",
    y = "Control",
    fill = "Rank"
  ) +
  theme(axis.text.y = element_text(size = 14))

colleges_levels |>
  drop_na(CCBASIC) |>
   filter(
    CCBASIC != "Special Focus Four-Year: Business & Management Schools" &
    CCBASIC != "Baccalaureate/Associate's Colleges: Mixed Baccalaureate/Associate's" &
    CCBASIC != "Special Focus Four-Year: Other Special Focus Institutions" &
    CCBASIC != "Special Focus Four-Year: Other Technology-Related Schools" &
    CCBASIC != "Tribal Colleges"
  ) |>
  mutate(
    CCBASIC = fct_relevel(CCBASIC, "Special Focus Four-Year: Other Health Professions Schools", "Master's Colleges & Universities: Medium Programs", "Master's Colleges & Universities: Larger Programs", "Doctoral/Professional Universities", "Special Focus Four-Year: Arts, Music & Design Schools", "Special Focus Four-Year: Faith-Related Institutions", "Special Focus Four-Year: Engineering Schools", "Baccalaureate Colleges: Diverse Fields", "Master's Colleges & Universities: Small Programs", "Doctoral Universities: High Research Activity", "Baccalaureate Colleges: Arts & Sciences Focus", "Doctoral Universities: Very High Research Activity")
  ) |>
  ggplot(aes(y = CCBASIC, fill = fct_rev(level))) +
  geom_bar(position = "fill", show.legend = FALSE) +
  scale_fill_viridis_d(limits = c("1-100", "101-200", "201-300", "301-400", "401-500")) +
  labs(
    title = "Rank vs Carnegie Classification",
    x = "Proportion",
    y = "Carnegie Classification (Basic)",
    fill = "Rank"
  ) +
  scale_y_discrete(labels = label_wrap(30)) +
  theme(axis.text.y = element_text(size = 14))
```

The Geographic Region graph shows that New England has the highest proportion of top-100 schools, while the Plains has the lowest. Apart from the `New England` and `NA` bars, the differences in proportions of rank groups do not vary dramatically between bars. It is possible that the strength of the correlation between rank and region is driven in large part by the association New England has with schools with the lowest 100 ranks. Additionally, as accrediting agency is often based on location, it reflects results similar to the region graph.

There are only 4 `Private, For-profit` schools in the top 500, and all of them are ranked between 301 and 400. The proportions of ranks between `Private, Non-profit` and `Public` are similar, although the first appears to have a larger proportion of `1-100` schools, and the latter a higher proportion of `401-500` schools.

There appears to be the greatest differences between bars of proportions of rank groups in the Carnegie Classification group, suggesting that this has the strongest association with rank. The lower the rank, the higher proportion of schools in `Doctoral Universities: Very High Research Activity` and `Baccalaureate Colleges: Arts & Sciences Focus`. However, the opposite appears to be true for all other classifications with 3 or more rank categories.

**Numerical Variable Analysis**

All of the variables with the top 5 R-squared values appeared in the final stepwise regression model except for `ACTCMMID` (median ACT score) and `C150_4` (graduation rate). These were not used in the model because both of them had a high correlation with `SAT_AVG` (average SAT). Because `SAT_AVG` ended up in the final model, we can reasonably assume that they also have a strong association with rank based on the model's selection process. Therefore, we conclude that `SAT_AVG`, `ACTCMMID`, `C150_4`, `AVGFACSAL`, and `ADM_RATE` have the strongest association with rank, and we will characterize the relationships below.

```{r}
#| label: numerical-graphs

a1 <- colleges |>
  ggplot(aes(x = SAT_AVG, y = rank)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0,500)) +
  labs(
    title = "Average SAT vs Rank",
    x = "Avg. SAT",
    y = "Rank"
  )

a2 <- colleges |>
  ggplot(aes(x = ACTCMMID, y = rank)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0,500)) +
  labs(
    title = "Median ACT vs Rank",
    x = "Med. ACT",
    y = "Rank"
  )

a3 <- colleges |>
  ggplot(aes(x = C150_4, y = rank)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0,500)) +
  labs(
    title = "Graduation Rate vs Rank",
    x = "Graduation Rate",
    y = "Rank"
  )

a4 <- colleges |>
  ggplot(aes(x = AVGFACSAL, y = rank)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0,500)) +
  labs(
    title = "Average Faculty Salary vs Rank",
    x = "Avg. Faculty Salary",
    y = "Rank"
  )

a5 <- colleges |>
  ggplot(aes(x = ADM_RATE, y = rank)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm") +
  scale_y_continuous(limits = c(0,500)) +
  labs(
    title = "Admission Rate vs Rank",
    x = "Admission Rate",
    y = "Rank"
  )

grid.arrange(a1, a2, a3, a4, a5, ncol = 2)
```

```{r}
#| output: false

#| label: numerical-slope-r2

linear_reg() |> fit(rank ~ SAT_AVG, data = scaled_data)
linear_reg() |> fit(rank ~ SAT_AVG, data = colleges)

linear_reg() |> fit(rank ~ ACTCMMID, data = scaled_data)
linear_reg() |> fit(rank ~ ACTCMMID, data = colleges)

linear_reg() |> fit(rank ~ C150_4, data = scaled_data)
linear_reg() |> fit(rank ~ C150_4, data = colleges)

linear_reg() |> fit(rank ~ AVGFACSAL, data = scaled_data)
linear_reg() |> fit(rank ~ AVGFACSAL, data = colleges)

linear_reg() |> fit(rank ~ ADM_RATE, data = scaled_data)
linear_reg() |> fit(rank ~ ADM_RATE, data = colleges)
```

```{r}

#| label: r2-slopes-kable

final_variables_table <- r_squared_values |>
  arrange(desc(r_squared)) |>
  slice_head(n = 5) |>
  mutate(
    scaled_slope = c(-115.0, -112.9, -103.5, -98.89, 91.74),
    slope = c(-0.9277, -30.04, -735.2, -0.0347, 387.02)
  ) |>
  rename(
    "R-Squared" = r_squared,
    "Scaled Slope" = scaled_slope,
    "Non-scaled Slope" = slope
  ) |>
  mutate(
    variable = case_when(
      variable == "SAT_AVG" ~ "Avg. SAT",
      variable == "ACTCMMID" ~ "Med. ACT",
      variable == "C150_4" ~ "Graduation Rate",
      variable == "AVGFACSAL" ~ "Avg. Faculty Salary",
      variable == "ADM_RATE" ~ "Admission Rate",
      TRUE ~ variable
    )
  )

final_variables_table |> 
  kable()
```

63% of the variation in college rank can be explained by average SAT score. This same interpretation can be used for all of the variables. Looking further at the relationships, there is a negative relationship between SAT, ACT, Graduation Rate, and Faculty Salary, and rank. This indicates that as the variables increase, the rank of the school decreases. There is a positive relationship between admission rate and rank, indicating that as this variable increases, the rank of a school increases.

The non-scaled slope shows that on average, we expect a 1-point increase in median ACT score to drop the rank of a school by 30 places. Since admission rate is scaled from 0-1, we need to divide the slope by 100 to interpret it. It indicates that a 1-point drop in admission rate will, on average, result in an estimated drop in rank of the school by 3.87 places. The other variables can be interpreted in the same manner. Scaled slopes allow us to compare them and tell the change in which numerical variable has the greatest impact on decreasing a school's rank. An interesting trend is that among the five most associated variables, the higher the R-squared, the higher the absolute value of scaled-slope, indicating that variables that have the strongest association to rank also have the greatest influence on decreasing rank. This is logical because Niche likely ties their rankings to variables with the greatest differentiation between higher and lower ranks. This is also concerning because if schools know which variables are most associated to rank and which ones have the greatest impact on decreasing it, it is fairly easy for them to know which variables to change if they wanted to manipulate the rankings.

## Discussion

Based on our analysis and approaches, `SAT_AVG`, `ACTCMMID`, `C150_4`, `AVGFACSAL`, and `ADM_RATE` are the numerical explanatory variables most associated with college ranking, which partially confirms our initial hypothesis. SAT/ACT/Admission Rate were among the most correlated variables, but family income was not in the top five, possibly because financial aid allows students from various financial backgrounds to attend universities. These relationships indicate certain priorities in college rankings. The existence of the SAT, ACT, and admission rate variables in the top five highlights how rankings prioritize selectivity in college admissions. This makes us wonder if colleges focus on improving admissions selectivity over their quality of education and student outcomes. The inclusion of graduation rate and faculty salary do tell slightly different stories. While graduation rate indicates a focus on the ability of a university to meet the needs of its students, faculty salary may indicate the quality of the faculty both in teaching and research.

The stepwise regression model indicated that the categorical variables `REGION`, `ACCREDAGENCY`, `CONTROL`, and `CCBASIC` were also important to calculating rank, but we did not include any of them in our hypothesis. Unlike the numerical variables, it is interesting to note that these variables cannot change easily from year-to-year, so colleges cannot easily use them to manipulate their rankings. Additionally, there are some variables that appear in the final model that have a lower individual R-squared value than some that were taken out of the model. We believe that this is because AIC examines the collective predictive power of the variables rather than the individual predictive ability.

The limitations of our analysis are as follows: We left the categorical variables out of the first approach because we do not know a way to numerically analyze them, and in doing so, we could not subject these variables to the same two-step confirmation process that we did for the numerical values. Furthermore, we assumed that all variables had a linear relationship with ranks so we could use linear regression modeling to analyze them. Additionally, our linear models assume that rank is continuous and goes on forever. We recognize that this is not the case, but since the rank values have meaning and we have not learned how to properly work with ranked data in this class, we decided that a linear regression model was our best approach. All of these issues could be resolved by learning and implementing more appropriate statistical methods. Finally, we believe that future avenues for this project could include analyzing and comparing more ranking systems, such as those created by US News and Forbes. It would be useful if students could understand what each system values and use the one most in line with their priorities. Additionally, we would like to look at over 1000 colleges to see if our results stay consistent between colleges throughout the country, and perhaps even throughout the world.

{{< pagebreak >}}

## Appendix

### Exploratory Analysis: Means of Rank by Categorical Variables

Below, we group the schools by the different categorical variables in our analysis and then take the mean rank for each of those groups. For Carnegie classification, any classification with only one school was removed from the analysis.

```{r}

#| label: exploration-control-region

group_by_control <- data |>
  group_by(CONTROL) |>
  summarize(
    mean_rank = mean(rank)
  ) |>
  arrange(mean_rank)


p1 <- group_by_control |>
  ggplot(aes(x = reorder(CONTROL, + mean_rank), y = mean_rank, fill = CONTROL)) +
  geom_col() +
  theme(legend.position = "none") +
  labs(
    x = "Control of School",
    y = "Mean Rank",
    title = "Mean Rank by Control"
  )


group_by_region <- data |>
  group_by(REGION) |>
  summarize(
    mean_rank = mean(rank)
  ) |>
  arrange(mean_rank)


p2 <- group_by_region |>
  drop_na(REGION) |>
  ggplot(aes(x = reorder(REGION, + mean_rank), y  = mean_rank, fill = REGION)) +
  geom_col() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 7)
  ) +
  labs(
    x = "REGION",
    y = "Mean Rank",
    title = "Mean Rank by Region"
  )

grid.arrange(p1, p2)
```

```{r}

#| label: exploration-CCBASIC

group_by_CCBASIC <- data |>
  group_by(CCBASIC) |>
  summarize(
    mean_rank = mean(rank)
  ) |>
  arrange(mean_rank)

group_by_CCBASIC |>
  drop_na(CCBASIC) |>
   filter(
    CCBASIC != "Special Focus Four-Year: Business & Management Schools" &
    CCBASIC != "Baccalaureate/Associate's Colleges: Mixed Baccalaureate/Associate's" &
    CCBASIC != "Special Focus Four-Year: Other Special Focus Institutions" &
    CCBASIC != "Special Focus Four-Year: Other Technology-Related Schools" &
    CCBASIC != "Tribal Colleges"
  ) |>
  ggplot(aes(y = reorder(CCBASIC, + mean_rank), x = mean_rank, fill = CCBASIC)) +
  geom_col() +
  theme(legend.position = "none") +
  labs(
    y = "Carnegie Classification",
    x = "Mean Rank",
    title = "Mean Rank by Carnegie \nClassification"
  )

```

We observe that private non-profit colleges have a higher mean rank than public colleges or private for-profit colleges.

As far as region, schools from New England have the highest mean rank, while schools from the Plains have the lowest mean rank.

Looking at the Carnegie Classification, `Doctoral Universities: Very High Research Activity` have the highest mean rank, followed by `Special Focus Four-Year: Engineering Schools`. `Master's Colleges and Universities: Medium Programs` have the lowest mean rank.

### Full R-square Value List

Here is the full list of R-squared values from the linear regression models between individual numerical explanatory variables and college rank.

```{r}
#| label: r-square-value-list
r_squared_values |>
  arrange(desc(r_squared)) |>
  kable()
```

{{< pagebreak >}}

## References

-   Learned how to use for loops from TA Eli Gnesin

-   https://www.statology.org/standardize-data-in-r/

-   https://www.niche.com/colleges/search/best-colleges/

-   https://collegescorecard.ed.gov/data/

-   https://www.youtube.com/watch?v=ejR8LnQziPY

-   https://stackoverflow.com/questions/57248708/stepwise-model-selection-in-an-r-tidyverse-workflow

-   https://stackoverflow.com/questions/53135404/filter-correlation-matrix-r

-   https://stackoverflow.com/questions/68093071/how-to-highlight-high-correlations-in-ggpairs-correlation-matrix

-   http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram

-   https://www.tutorialspoint.com/how-to-deal-with-missing-values-to-calculate-correlation-matrix-in-r

-   https://www.displayr.com/how-to-create-a-correlation-matrix-in-r/

-   https://stats.stackexchange.com/questions/550537/how-to-get-r-squared-after-doing-stepwise-model-selection-in-regression-in-r

-   http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

-   https://www.researchgate.net/figure/R-2-and-RMSE-of-forward-stepwise-regression-models-vs-WHO-algorithm_tbl1_354396022

-   https://www.r-bloggers.com/2016/05/visualizing-bootrapped-stepwise-regression-in-r-using-plotly/

-   https://www.tutorialspoint.com/how-to-increase-the-x-axis-labels-font-size-using-ggplot2-in-r