Merge pull request #37 from fhdsl/updating-factors

[Factors] update datasets
fhdsl · Jul 2, 2024 · 1dd33fe · 1dd33fe
2 parents fcd2a56 + cfe1dd1
commit 1dd33fe
Showing 1 changed file with 67 additions and 90 deletions.
diff --git a/modules/Factors/Factors.Rmd b/modules/Factors/Factors.Rmd
@@ -76,125 +76,101 @@ x_fact
 
 ## A Factor Example{.smaller}
 
-We will use data on student dropouts from the State of California during the 2016-2017 school year. More on this data can be found here: https://www.cde.ca.gov/ds/ad/filesdropouts.asp
+We will use data on heat-related visits to the ER from the State of Colorado, separated by age category, for 2011-2022. More on this data can be found here: https://coepht.colorado.gov/heat-related-illness
 
-To preserve school anonymity, "CDS_CODE" is used in place of the individual school's name.
+You can download the data from the DaSEH website here: https://daseh.org/data/CO_ER_heat_visits_by_age_data.csv
 
-You can download the data from the DaSEH website here: https://daseh.org/data/dropouts.txt
+This dataset is also available in the `dasehr` package.
 
-```{r}
-dropouts <- read_delim("https://daseh.org/data/dropouts.txt", delim = "\t")
-dropouts
-```
-
-## Preparing the data
-
-Aggregate (sum) across ethnicity and gender:
+We will limit the data to only one of the `gender` categories - we will choose "Both genders" because of data missingness.
 
 ```{r}
-dropouts <-
-  dropouts %>%
-  group_by(CDS_CODE) %>%
-  summarize(
-    Freshman = sum(D9),
-    Sophomore = sum(D10),
-    Junior = sum(D11),
-    Senior = sum(D12)
-  )
-dropouts
-```
+library(dasehr)
+er_visits_age <- CO_heat_ER_byage
 
-## Preparing the data
+#er_visits_age <- read_csv("https://daseh.org/data/CO_ER_heat_visits_by_age_data.csv")
 
-Pivot to long format:
-
-```{r}
-dropouts <-
-  dropouts %>%
-  pivot_longer(
-    !CDS_CODE,
-    names_to = "grade",
-    values_to = "n_dropouts"
-  )
-dropouts
+er_visits_age <- er_visits_age %>% 
+  filter(str_detect(GENDER, "Both genders")) 
 ```
 
+
 ## The data
 
 ```{r}
-head(dropouts)
+head(er_visits_age)
 ```
 
-Notice that `grade` is a `chr` variable. This indicates that the values are **character** strings.
+Notice that `AGE` is a `chr` variable. This indicates that the values are **character** strings.
 
-R does not realize that there is any order related to the `grade` values. It will assume that it is **alphabetical**.
+R does not realize that there is any order related to the `AGE` values. It will assume that it is **alphabetical** (for numbers, this means ascending order).
 
-However, we know that the order is: **freshman**, **sophomore**, **junior**, **senior**.
+However, we know that the order is: **0-4 years old**, **5-14 years old**, **15-34 years old**, **35-64 years old**, **65+ years old**, and **All ages**.
 
 ## For the next steps, let's take a subset of data.
 
 Use `set.seed()` to take the same random sample each time.
 
 ```{r}
 set.seed(123)
-dropouts_subset <- slice_sample(dropouts, n = 32)
+er_visits_age_subset <- slice_sample(er_visits_age, n = 32)
 ```
 
 ## Plot the data
 
 Let's make a plot first.
 
-```{r, fig.height= 3}
-dropouts_subset %>%
-  ggplot(mapping = aes(x = grade, y = n_dropouts)) +
+```{r, fig.height= 3, warning = F}
+er_visits_age_subset %>%
+  ggplot(mapping = aes(x = AGE, y = RATE)) +
   geom_boxplot() +
-  theme_bw(base_size = 16) # make all labels size 16
+  theme_bw(base_size = 8) # make all labels size 8
 ```
 
 OK this is very useful, but it is a bit difficult to read. We expect the values to be plotted by the order that we know, not by alphabetical order.
 
 ## Change to factor
 
-Currently `grade` is class `character` but let's change that to class `factor` which allows us to specify the levels or order of the values.
+Currently `AGE` is class `character` but let's change that to class `factor` which allows us to specify the levels or order of the values.
 
 ```{r}
-dropouts_fct <-
-  dropouts_subset %>%
-  mutate(grade = factor(grade,
-    levels = c("Freshman", "Sophomore", "Junior", "Senior")
+er_visits_age_fct <-
+  er_visits_age_subset %>%
+  mutate(AGE = factor(AGE,
+    levels = c("0-4 years old", "5-14 years old", "15-34 years old", "35-64 years old", "65+ years old", "All ages")
   ))
 
-dropouts_fct %>%
-  pull(grade) %>%
+er_visits_age_fct %>%
+  pull(AGE) %>%
   levels()
 ```
 
 ## Change to a factor
 
 ```{r}
-head(dropouts_fct)
+head(er_visits_age_fct)
 ```
 
 ## Plot again
 
 Now let's make our plot again:
 
-```{r, fig.height= 3}
-dropouts_fct %>%
-  ggplot(mapping = aes(x = grade, y = n_dropouts)) +
+```{r, fig.height= 3, warning = FALSE}
+er_visits_age_fct %>%
+  ggplot(mapping = aes(x = AGE, y = RATE)) +
   geom_boxplot() +
-  theme_bw(base_size = 16)
+  theme_bw(base_size = 8)
 ```
 
 Now that's more like it! Notice how the data is automatically plotted in the order we would like.
 
 ## What about if we `arrange()` the data by grade ?{.smaller}
 
-Character data is arranged alphabetically.
+Character data is arranged alphabetically (if letters) or by ascending first number (if numbers).
 
 ```{r}
-dropouts_subset %>%
-  arrange(grade)
+er_visits_age_subset %>%
+  arrange(AGE)
 ```
 
 Notice that the order is not what we would hope for!
@@ -204,46 +180,46 @@ Notice that the order is not what we would hope for!
 Factor data is arranged by level.
 
 ```{r}
-dropouts_fct %>%
-  arrange(grade)
+er_visits_age_fct %>%
+  arrange(AGE)
 ```
 
 Nice! Now this is what we would want!
 
 ## Making tables with characters
 
-Tables grouped by a character are arranged alphabetically.
+Tables grouped by a character are arranged alphabetically (if letters) or by ascending first number (if numbers).
 
 ```{r}
-dropouts_subset %>%
-  group_by(grade) %>%
-  summarize(total_dropouts = sum(n_dropouts))
+er_visits_age_subset %>%
+  group_by(AGE) %>%
+  summarize(total_visits = sum(VISITS, na.rm = T))
 ```
 
 ## Making tables with factors
 
 Tables grouped by a factor are arranged by level.
 
 ```{r}
-dropouts_fct %>%
-  group_by(grade) %>%
-  summarize(total_dropouts = sum(n_dropouts))
+er_visits_age_fct %>%
+  group_by(AGE) %>%
+  summarize(total_visits = sum(VISITS, na.rm = T))
 ```
 
 ## `forcats` for ordering{.smaller}
 
-What if we wanted to order `grade` by increasing `n_dropouts`?
+What if we wanted to order `AGE` by increasing `RATE``?
 
-```{r, fig.height= 3}
+```{r, fig.height= 3, warning=FALSE}
 library(forcats)
 
-dropouts_fct %>%
-  ggplot(mapping = aes(x = grade, y = n_dropouts)) +
+er_visits_age_fct %>%
+  ggplot(mapping = aes(x = AGE, y = RATE)) +
   geom_boxplot() +
-  theme_bw(base_size = 16)
+  theme_bw(base_size = 8)
 ```
 
-This would be useful for identifying easily which grade to focus on.
+This would be useful for identifying easily which age group to focus on.
 
 ## forcats for ordering{.smaller}
 
@@ -259,46 +235,47 @@ fct_reorder({column getting changed}, {guiding column}, {summarizing function})
 
 We can order a factor by another variable by using the `fct_reorder()` function of the `forcats` package.
 
-```{r, fig.height= 3}
+```{r, fig.height= 3, warning = F}
 library(forcats)
 
-dropouts_fct %>%
-  ggplot(mapping = aes(x = fct_reorder(grade, n_dropouts, mean), y = n_dropouts)) +
+er_visits_age_fct %>%
+  ggplot(mapping = aes(x = fct_reorder(AGE, RATE, mean), y = RATE)) +
   geom_boxplot() +
-  labs(x = "Student Grade") +
-  theme_bw(base_size = 16)
+  labs(x = "Age Category") +
+  theme_bw(base_size = 8)
 ```
 
 ## forcats for ordering.. with `.desc = ` argument{.smaller}
 
-```{r, fig.height= 3}
+```{r, fig.height= 3, warning = F}
 library(forcats)
 
-dropouts_fct %>%
-  ggplot(mapping = aes(x = fct_reorder(grade, n_dropouts, mean, .desc = TRUE), y = n_dropouts)) +
+er_visits_age_fct %>%
+  ggplot(mapping = aes(x = fct_reorder(AGE, RATE, mean, .desc = TRUE), y = RATE)) +
   geom_boxplot() +
-  labs(x = "Student Grade") +
-  theme_bw(base_size = 16)
+  labs(x = "Age Category") +
+  theme_bw(base_size = 8)
 ```
 
 ## forcats for ordering.. can be used to sort datasets
 
-```{r, fig.height= 3}
-dropouts_fct %>% pull(grade) %>% levels() # By year order
-dropouts_fct <- dropouts_fct %>%
+```{r, fig.height= 3, warning=FALSE}
+er_visits_age_fct %>% pull(AGE) %>% levels() # By year order
+er_visits_age_fct <- er_visits_age_fct %>%
   mutate(
-    grade = fct_reorder(grade, n_dropouts, mean)
+    AGE = fct_reorder(AGE, RATE, mean)
   )
-dropouts_fct %>% pull(grade) %>% levels() # by increasing mean dropouts
+
+er_visits_age_fct %>% pull(AGE) %>% levels() # by increasing mean dropouts
 ```
 
 ## Checking Proportions with `fct_count()`
 
 The `fct_count()` function of the `forcats` package is helpful for checking that the proportions of each level for a factor are similar. Need the `prop = TRUE` argument otherwise just counts are reported.
 
 ```{r}
-dropouts_fct %>%
-  pull(grade) %>%
+er_visits_age_fct %>%
+  pull(AGE) %>%
   fct_count(prop = TRUE)
 ```