Merge pull request #112 from fhdsl/updating-statistics

[Statistics] Changed the examples in the statistics lecture
fhdsl · Jul 17, 2024 · 49ef856 · 49ef856
2 parents 1d0fe2a + 50189f2
commit 49ef856
Show file tree

Hide file tree

Showing 6 changed files with 341 additions and 368 deletions.
diff --git a/modules/Data_Output/lab/Data_Output_Lab.Rmd b/modules/Data_Output/lab/Data_Output_Lab.Rmd
@@ -29,7 +29,7 @@ library(readr)
 
 ### 1.2
 
-Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland" (aka the entire USA). Store the modified dataset as `covid_filtered`.
+Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland". Store the modified dataset as `covid_filtered`.
 
 ```
 # General format

diff --git a/modules/Data_Output/lab/Data_Output_Lab_Key.Rmd b/modules/Data_Output/lab/Data_Output_Lab_Key.Rmd
@@ -30,7 +30,7 @@ covid <- read_csv(file = "https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv"
 
 ### 1.2
 
-Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland" (aka the entire USA). Store the modified dataset as `covid_filtered`.
+Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland". Store the modified dataset as `covid_filtered`.
 
 ```
 # General format

diff --git a/modules/Data_Output/lab/Data_Output_Lab_Key.html b/modules/Data_Output/lab/Data_Output_Lab_Key.html
@@ -366,8 +366,18 @@ <h3>1.1</h3>
 <pre><code># General format
 library(readr)
 # OBJECT &lt;- read_csv(FILE)</code></pre>
-<pre class="r"><code>library(tidyverse)
-covid &lt;- read_csv(file = &quot;https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv&quot;)</code></pre>
+<pre class="r"><code>library(tidyverse)</code></pre>
+<pre><code>## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+## ✔ dplyr     1.1.4     ✔ readr     2.1.5
+## ✔ forcats   1.0.0     ✔ stringr   1.5.1
+## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
+## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
+## ✔ purrr     1.0.2     
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag()    masks stats::lag()
+## ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</code></pre>
+<pre class="r"><code>covid &lt;- read_csv(file = &quot;https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv&quot;)</code></pre>
 <pre><code>## Warning: One or more parsing issues, call `problems()` on your data frame for details,
 ## e.g.:
 ##   dat &lt;- vroom(...)
@@ -383,7 +393,7 @@ <h3>1.1</h3>
 <div id="section-1" class="section level3">
 <h3>1.2</h3>
 <p>Filter the dataset so that the “reporting_jurisdiction” column is
-equal to “Maryland” (aka the entire USA). Store the modified dataset as
+equal to “Maryland”. Store the modified dataset as
 <code>covid_filtered</code>.</p>
 <pre><code># General format
 NEW_OBJECT &lt;- OBJECT %&gt;% filter(COLUMNNAME == CRITERIA)</code></pre>

diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd
@@ -20,7 +20,7 @@ library(dplyr)
 options(scipen = 999)
 library(readr)
 library(ggplot2)
-library(emo)
+library(tidyverse)
 ```
 
 ## Summary
@@ -59,10 +59,10 @@ knitr::include_graphics("https://c.tenor.com/O3x8ywLm380AAAAd/chevy-chase.gif")
 ## Overview
 
 ::: {style="color: red;"}
-`r emo::ji("rotating_light")` We will focus on how to use R software to do these. We will be glossing over the statistical **theory** and "formulas" for these tests. Moreover, we do not claim the data we use for demonstration meet **assumptions** of the methods. `r emo::ji("rotating_light")`
+We will focus on how to use R software to do these. We will be glossing over the statistical **theory** and "formulas" for these tests. Moreover, we do not claim the data we use for demonstration meet **assumptions** of the methods. 
 :::
 
-There are plenty of resources online for learning more about these methods, as well as dedicated Biostatistics series (at different advancement levels) at the JHU School of Public Health.
+There are plenty of resources online for learning more about these methods.
 
 Check out [www.opencasestudies.org](https://www.opencasestudies.org/) for deeper dives on some of the concepts covered here and the [resource page](https://daseh.org/resources.html) for more resources.
 
@@ -129,44 +129,47 @@ Like other functions, if there are `NA`s, you get `NA` as the result.  But if yo
 
 ```{r}
 # x and y must be numeric vectors
-x <- pull(yearly_co2_emissions, `1989`)
 
-y <- pull(yearly_co2_emissions, `2014`)
+co2long <- yearly_co2_emissions %>% pivot_longer(-country)
+co2wide <-co2long %>% pivot_wider(names_from = country, values_from = value)
+
+US <- co2wide %>% pull(`United States`)
+UK <- co2wide %>% pull(`United Kingdom`)
 ```
 
 ```{r,fig.width=3, fig.height=3}
 # have to specify which data on each axis
 # can accomodate missing data
-plot(x, y)
+plot(US, UK)
 ```
 
 ## Correlation coefficient calculation and test
 
 ```{r}
 library(broom)
-cor(x, y)
-cor(x, y, use = "complete.obs")
-cor.test(x, y)
+cor(US, UK)
+cor(US, UK, use = "complete.obs")
+cor.test(US, UK)
 ```
 
 ## Broom package
 
 The `broom` package helps make stats results look tidy
 
 ```{r}
-cor_result <- tidy(cor.test(x, y))
+cor_result <- tidy(cor.test(US, UK))
 glimpse(cor_result)
 ```
 
 ## Correlation for two vectors with plot
 
 In plot form... `geom_smooth()` and `annotate()` can help.
 
-```{r}
+```{r, warning = F}
 corr_value <- pull(cor_result, estimate) %>% round(digits = 4)
 cor_label <- paste0("R = ", corr_value)
-yearly_co2_emissions %>%
-  ggplot(aes(x = `1989`, y = `2014`)) +
+co2wide %>%
+  ggplot(aes(x = US, y = UK)) +
   geom_point(size = 0.3) +
   geom_smooth() +
   annotate("text", x = 2000, y = 7500, label = cor_label)
@@ -195,7 +198,8 @@ We can compute correlation for all pairs of columns of a data frame / matrix. Th
 Columns must be all numeric!
 
 ```{r}
-co2_subset <- yearly_co2_emissions %>% select(c(`1909`, `1929`, `1949`, `1969`, `1989`, `2009`))
+co2_subset <- co2wide %>% select(c(`United States`, `United Kingdom`, Canada))
+
 head(co2_subset)
 ```
 
@@ -253,14 +257,13 @@ It tests the mean of a variable in one group. By default (i.e., without us expli
 - returns result assuming confidence level 0.95 (`conf.level = 0.95`)
 - omits `NA` values in data
 
-Let's look at the dataset of haloacetic acid levels in public water sources in Washington, saved as `haa5` in the `dasehr` package.
+Let's look at the UK and US CO2 emissions data again. 
 
 ```{r}
 
-x <- haa5 %>% pull(perc_pop_exposed_to_exceedances)
-sum(is.na(x)) # count NAs in x
+sum(is.na(US)) # count NAs in x
 
-t.test(x)
+t.test(US)
 ```
 
 ## Running two-sample t-test {.small}
@@ -279,28 +282,25 @@ Check out this this [case study](https://www.opencasestudies.org/ocs-bp-rural-an
 ## Running two-sample t-test in R
 
 ```{r}
-x <- haa5 %>% pull(pop_on_sampled_PWS)
-y <- haa5 %>% pull(pop_exposed_to_exceedances)
+sum(is.na(US))
+sum(is.na(UK)) # count NAs in x and y
 
-sum(is.na(x))
-sum(is.na(y)) # count NAs in x and y
-
-t.test(x, y)
+t.test(US, UK)
 ```
 
 ## T-test: retrieving information from the result with `broom` package
 
 The `broom` package has a `tidy()` function that can organize results into a data frame so that they are easily manipulated (or nicely printed)
 
 ```{r broom, comment=""}
-result <- t.test(x, y)
+result <- t.test(US, UK)
 result_tidy <- tidy(result)
 glimpse(result_tidy)
 ```
 
 ## P-value adjustment {.smaller}
 
-`r emo::ji("rotating_light")` You run an increased risk of Type I errors (a "false positive") when multiple hypotheses are tested simultaneously. `r emo::ji("rotating_light")`
+You run an increased risk of Type I errors (a "false positive") when multiple hypotheses are tested simultaneously. 
 
 Use the `p.adjust()` function on a vector of p values. Use `method = ` to specify the adjustment method:
 
@@ -467,22 +467,17 @@ For example, if we want to fit a regression model where outcome is `income` and
 
 ## Linear regression
 
-We will use our dataset about nitrate levels by quarter in public water sources in Washington. We'll load a slightly different version of this dataset, which can be found at
-https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_v2_data.csv.
-
-
-```{r}
-nitrate <- read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_v2_data.csv", show_col_types = F)
+We will use our the calenviroscreen dataset from the `dasehr` package to examine how traffic estimates predict diesel particulate emissions.
 
-head(nitrate)
-```
 
 ## Linear regression: model fitting
 
-The upper limit of acceptable nitrates in water is 10 ug/L. We fit linear regression model with the number of people on public water systems with more than the limit of nitrates (`pop_exposed_to_exceedances`) as an outcome and the number of people exposed to 3-5 ug/L of nitrates (`pop_>3-5ug/L`) as a predictor. In other words, we are evaluating if the number of people exposed to low levels of nitrates in their water is predictive of the number of people exposed to excess nitrates in their water.
+For this model, we will use two variables:
+**DieselPM** - estimated diesel particulate emissions from on-road and non-road sources
+**TrafficPctl** - percentile ranking of traffic density
 
 ```{r}
-fit <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L`, data = nitrate)
+fit <- glm(DieselPM ~ TrafficPctl, data = calenviroscreen)
 fit
 ```
 
@@ -496,18 +491,18 @@ summary(fit)
 
 ## tidy results
 The broom package can help us here too!
-The estimate is the coefficient or slope - for one change in hours worked (1 hour increase), we see 1.58 more visits. The error for this estimate is relatively small at 0.167. This relationship appears to be significant with a small p value <0.001.
+The estimate is the coefficient or slope - for one change in the traffic percentile, we see 0.003637 more Diesel particulate emissions. The error for this estimate is relatively small at 0.00009. This relationship appears to be significant with a small p value < 2e-16.
 
 ```{r}
 tidy(fit) %>% glimpse()
 ```
 
 ## Linear regression: multiple predictors {.smaller}
 
-Let's try adding another explanatory variable to our model, year (`year`). The tidy function will not work with this unfortunately. The meaning of coefficients is more complicated here.
+Let's try adding another explanatory variable to our model, amount of daily Ozone concentration (`Ozone`). Ozone is usually inversely related to particulate measures. The tidy function will not work with this unfortunately. The meaning of coefficients is more complicated here.
 
 ```{r}
-fit2 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year, data = nitrate)
+fit2 <- glm(DieselPM ~ TrafficPctl + Ozone, data = calenviroscreen)
 summary(fit2)
 ```
 
@@ -525,10 +520,15 @@ fit2 %>%
 
 Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are **relative** to its values.
 
-`quarter` takes values Q1, Q2, Q3, and Q4 to indicate the quarter of the year for each observation.
+Let's create a variable that tells us whether a census tract has a high, middle, or low percentage of the population below the poverty line.
 
 ```{r}
-nitrate %>% count(quarter)
+calenviroscreen <- calenviroscreen %>% mutate(PovertyPctl_level = case_when(
+                    PovertyPctl > 0.75 ~ "high",
+                    PovertyPctl >0.25 & PovertyPctl<=0.75 ~ "middle",
+                    PovertyPctl <=0.25 ~ "low",
+                    TRUE ~ "unknown"
+))
 ```
 
 
@@ -539,7 +539,7 @@ The comparison group that is not listed is treated as intercept. All other estim
 
 ```{r regressbaseline, comment="", fig.height=4,fig.width=8}
 
-fit3 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + factor(quarter), data = nitrate)
+fit3 <- glm(DieselPM ~ TrafficPctl + Ozone + factor(PovertyPctl_level), data = calenviroscreen)
 summary(fit3)
 ```
 
@@ -548,13 +548,13 @@ summary(fit3)
 Relative to the level is not listed.
 
 ```{r}
-nitrate <- nitrate %>% mutate(quarter = factor(quarter,
+calenviroscreen <- calenviroscreen %>% mutate(PovertyPctl_level = factor(PovertyPctl_level,
   levels =
     c(
-      "Q1", "Q2", "Q3", "Q4"
+      "low", "middle", "high", "unknown"
     )
 ))
-fit4 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter, data = nitrate)
+fit4 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level, data = calenviroscreen)
 summary(fit4)
 ```
 
@@ -563,7 +563,7 @@ summary(fit4)
 You can view estimates for the comparison group by removing the intercept in the GLM formula `y ~ x - 1`. *Caveat* is that the p-values change.
 
 ```{r regress9, comment="", fig.height=4, fig.width=8}
-fit5 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter - 1, data = nitrate)
+fit5 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level - 1, data = calenviroscreen)
 summary(fit5)
 ```
 
@@ -572,7 +572,7 @@ summary(fit5)
 You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.
 
 ```{r fig.height=4, fig.width=8}
-fit6 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter + year * quarter, data = nitrate)
+fit6 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level + TrafficPctl*PovertyPctl_level, data = calenviroscreen) 
 tidy(fit6)
 ```
 
@@ -581,11 +581,13 @@ tidy(fit6)
 By default, `ggplot` with a factor added as a color will look include the interaction term. Notice the different intercept and slope of the lines.
 
 ```{r fig.height=3.5, fig.width=7, warning=FALSE}
-ggplot(nitrate, aes(x = `pop_>3-5ug/L`, y = pop_exposed_to_exceedances, color = quarter)) +
-  geom_point(size = 1, alpha = 0.8) +
+ggplot(calenviroscreen, aes(x = DieselPM, y = TrafficPctl, color = PovertyPctl_level)) +
+  geom_point(size = 1, alpha = 0.1) +
   geom_smooth(method = "glm", se = FALSE) +
   scale_color_manual(values = c("black", "grey45", "grey65", "grey85")) +
-  theme_classic()
+  theme_classic() +
+  ylim(0,100) +
+  xlim(0, 3)
 ```
 
 ## Generalized linear models (GLMs)
@@ -606,31 +608,31 @@ See `?family` documentation for details of family functions.
 
 ## Logistic regression {.smaller}
 
-Let's look at a logistic regression example. We'll use the `nitrate` dataset again, but first we'll need to create a binary variable that tells us whether the percentage of the population exposed to exceedances `perc_pop_exposed_to_exceedances` is greater than 0.1%.
+Let's look at a logistic regression example. We'll use the `calenviroscreen` dataset again. We will create a new binary variable based on the DieselPM percentile variable, so we can tell whether a census tract has high or low DieselPM emissions compared to the others.
 
 ```{r}
-nitrate <-
-  nitrate %>%
+calenviroscreen <-
+  calenviroscreen %>%
     mutate(
-      PercExpMore0.1 = case_when
-      (perc_pop_exposed_to_exceedances > 0.1 ~ 1,
-        perc_pop_exposed_to_exceedances <= 0.1 ~ 0))
+      DieselPM_level = case_when
+      (DieselPMPctl > 0.75 ~ 1,
+        DieselPMPctl <= 0.75 ~ 0))
 ```
 
 
 ## Logistic regression {.smaller}
 
-Now that we've created the `PercExpMore0.1` variable (where a `1` indicates the percentage of the population exposed to excessive nitrates is greater than 0.1%), we can run a logistic regression.
+Now that we've created the `DieselPM_level` variable (where a `1` indicates the census tract is one of the top 75% when it comes to dieselPM emissions), we can run a logistic regression.
 
-Let's explore how `quarter` and `year` might predict `PercExpMore0.1`.
+Let's explore how `PovertyPctl_level` might predict `DieselPM_level`.
 
 ```
 # General format
 glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
 ```
 
 ```{r regress7, comment="", fig.height=4,fig.width=8}
-binom_fit <- glm(PercExpMore0.1 ~ year + quarter, data = nitrate, family = binomial(link = "logit"))
+binom_fit <- glm(DieselPM_level ~ PovertyPctl_level, data = calenviroscreen, family = binomial(link = "logit"))
 summary(binom_fit)
 ```
 
@@ -645,12 +647,20 @@ Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).
 
 Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.
 
-In this case, we're calculating the odds ratio for whether the the first 6 months or last 6 months predicts whether the percentage of population exposed to excess nitrates is more than 0.1%
+In this case, we're calculating the odds ratio for whether living in a high traffic area predicts high levels of DieselPM.
 
 ```{r}
 library(epitools)
-response <- nitrate %>% pull(PercExpMore0.1)
-predictor <- nitrate %>% pull(half_of_year)
+
+calenviroscreen <-
+  calenviroscreen %>%
+    mutate(
+      Traffic_level = case_when
+      (TrafficPctl > 0.75 ~ 1,
+        TrafficPctl <= 0.75 ~ 0))
+
+response <- calenviroscreen %>% pull(DieselPM_level)
+predictor <- calenviroscreen %>% pull(Traffic_level)
 oddsratio(predictor, response)
 ```
-Original file line number
+Diff line change
@@ Expand Up / @@ -29,7 +29,7 @@ library(readr) @@
     ### 1.2
-    Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland" (aka the entire USA). Store the modified dataset as `covid_filtered`.
+    Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland". Store the modified dataset as `covid_filtered`.
     ```
     # General format
@@ Expand Down @@