-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path07-PCA.Rmd
148 lines (114 loc) · 4.26 KB
/
07-PCA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# Principle Component Analysis (PCA)
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, message = FALSE, warning = FALSE)
```
____________________________________
### Outline Chapter 7 - Principle Component Analysis:
1. 10th grade student demographics & school safety (ELS, 2002 public-use data)
**NOTE:** Syntax is modeled after Allison Horst's PCA lab at UCSB (ESM-206; Horst, 2020) using an example applied to Education data.
____________________________________
DATA SOURCE: This lab exercise utilizes the NCES public-use dataset: Education Longitudinal Study of 2002 (Lauff & Ingels, 2014) [$\color{blue}{\text{See website: nces.ed.gov}}$](https://nces.ed.gov/surveys/els2002/avail_data.asp)
____________________________________
### load packages
```{r, eval=TRUE}
library(FactoMineR)
library(factoextra)
library(skimr)
library(naniar)
library(ggfortify)
library(janitor)
library(tidyverse)
library(here)
```
### read in ELS-2002 lab data:
```{r}
lab_data <- read_csv("https://garberadamc.github.io/project-site/data/els_sub4.csv")
```
### make all column names "lower_snake_case" style
```{r}
lab_tidy <- lab_data %>%
clean_names()
```
### Prepare data for PCA
```{r}
# remove variables that don't make sense in a PCA
lab_sub1 <- lab_tidy %>%
select(-stu_id, # these are random numbers
-sch_id,
-byrace, # nominal (non-ordered variable)
-byparace, # nominal (non-ordered variable)
-byparlng, # nominal (non-ordered variable)
-byfcomp, # nominal (non-ordered variable)
-bypared, -bymothed, -byfathed,
-bysctrl, -byurban, -byregion)
# select columns and rename variables to have descriptive names
lab_sub2 <- lab_sub1 %>%
select(1:9,
bys20a, bys20h, bys20j, bys20k, bys20m, bys20n,
bys21b, bys21d, bys22a, bys22b, bys22c, bys22d,
bys22e, bys22g, bys22h, bys24a, bys24b) %>%
rename("stu_exp" = "bystexp",
"par_asp" = "byparasp",
"mth_read" = "bytxcstd",
"mth_test" = "bytxmstd",
"rd_test" = "bytxrstd",
"freelnch" = "by10flp",
"stu_tch" = "bys20a",
"putdownt" = "bys20h",
"safe" = "bys20j",
"disrupt" = "bys20k",
"gangs" = "bys20m",
"rac_fght" = "bys20n",
"fair" = "bys21b",
"strict" = "bys21d",
"stolen" = "bys22a",
"drugs" = "bys22b",
"t_hurt" = "bys22c",
"p_fight" = "bys22d",
"hit" = "bys22e",
"damaged" = "bys22g",
"bullied" = "bys22h",
"late" = "bys24a",
"skipped" = "bys24b")
```
### Investigate missingness {`naniar`} & make data summary with {`skimr`}
```{r}
# Plot number of missings by variable
gg_miss_var(lab_sub2)
# Look at summary of data using skimr::skim()
skim(lab_sub2)
pca1 <- lab_sub2 %>%
drop_na()
```
### run PCA with `prcomp()` (function does not permit NA values)
```{r, eval = FALSE}
pca_out1 <- prcomp(pca1, scale = TRUE)
plot(pca_out1)
#summary(pca_out1)
```
### plot PCA biplot
```{r}
jpeg(here("figures", "biplot_pca1.jpg"), res = 100) # to save the biplot
my_biplot <- autoplot(pca_out1,
colour = NA,
loadings.label = TRUE,
loadings.label.size = 3,
loadings.label.colour = "black",
loadings.label.repel = TRUE) +
theme_minimal()
my_biplot
dev.off()
```
```{r}
my_biplot
```
### alternative funtion to run & plot PCA biplot
```{r}
PCA(pca1, scale.unit = TRUE, ncp = 20, graph = TRUE)
```
## References
Hallquist, M. N., & Wiley, J. F. (2018). MplusAutomation: An R Package for Facilitating Large-Scale Latent Variable Analyses in Mplus. Structural equation modeling: a multidisciplinary journal, 25(4), 621-638.
Horst, A. (2020). Course & Workshop Materials. GitHub Repositories, https://https://allisonhorst.github.io/
Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686