Skip to content

Commit

Permalink
Merge pull request #112 from fhdsl/updating-statistics
Browse files Browse the repository at this point in the history
[Statistics] Changed the examples in the statistics lecture
  • Loading branch information
avahoffman authored Jul 17, 2024
2 parents 1d0fe2a + 50189f2 commit 49ef856
Show file tree
Hide file tree
Showing 6 changed files with 341 additions and 368 deletions.
2 changes: 1 addition & 1 deletion modules/Data_Output/lab/Data_Output_Lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ library(readr)

### 1.2

Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland" (aka the entire USA). Store the modified dataset as `covid_filtered`.
Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland". Store the modified dataset as `covid_filtered`.

```
# General format
Expand Down
2 changes: 1 addition & 1 deletion modules/Data_Output/lab/Data_Output_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ covid <- read_csv(file = "https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv"

### 1.2

Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland" (aka the entire USA). Store the modified dataset as `covid_filtered`.
Filter the dataset so that the "reporting_jurisdiction" column is equal to "Maryland". Store the modified dataset as `covid_filtered`.

```
# General format
Expand Down
16 changes: 13 additions & 3 deletions modules/Data_Output/lab/Data_Output_Lab_Key.html
Original file line number Diff line number Diff line change
Expand Up @@ -366,8 +366,18 @@ <h3>1.1</h3>
<pre><code># General format
library(readr)
# OBJECT &lt;- read_csv(FILE)</code></pre>
<pre class="r"><code>library(tidyverse)
covid &lt;- read_csv(file = &quot;https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv&quot;)</code></pre>
<pre class="r"><code>library(tidyverse)</code></pre>
<pre><code>## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</code></pre>
<pre class="r"><code>covid &lt;- read_csv(file = &quot;https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv&quot;)</code></pre>
<pre><code>## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat &lt;- vroom(...)
Expand All @@ -383,7 +393,7 @@ <h3>1.1</h3>
<div id="section-1" class="section level3">
<h3>1.2</h3>
<p>Filter the dataset so that the “reporting_jurisdiction” column is
equal to “Maryland” (aka the entire USA). Store the modified dataset as
equal to “Maryland”. Store the modified dataset as
<code>covid_filtered</code>.</p>
<pre><code># General format
NEW_OBJECT &lt;- OBJECT %&gt;% filter(COLUMNNAME == CRITERIA)</code></pre>
Expand Down
134 changes: 72 additions & 62 deletions modules/Statistics/Statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ library(dplyr)
options(scipen = 999)
library(readr)
library(ggplot2)
library(emo)
library(tidyverse)
```

## Summary
Expand Down Expand Up @@ -59,10 +59,10 @@ knitr::include_graphics("https://c.tenor.com/O3x8ywLm380AAAAd/chevy-chase.gif")
## Overview

::: {style="color: red;"}
`r emo::ji("rotating_light")` We will focus on how to use R software to do these. We will be glossing over the statistical **theory** and "formulas" for these tests. Moreover, we do not claim the data we use for demonstration meet **assumptions** of the methods. `r emo::ji("rotating_light")`
We will focus on how to use R software to do these. We will be glossing over the statistical **theory** and "formulas" for these tests. Moreover, we do not claim the data we use for demonstration meet **assumptions** of the methods.
:::

There are plenty of resources online for learning more about these methods, as well as dedicated Biostatistics series (at different advancement levels) at the JHU School of Public Health.
There are plenty of resources online for learning more about these methods.

Check out [www.opencasestudies.org](https://www.opencasestudies.org/) for deeper dives on some of the concepts covered here and the [resource page](https://daseh.org/resources.html) for more resources.

Expand Down Expand Up @@ -129,44 +129,47 @@ Like other functions, if there are `NA`s, you get `NA` as the result. But if yo

```{r}
# x and y must be numeric vectors
x <- pull(yearly_co2_emissions, `1989`)
y <- pull(yearly_co2_emissions, `2014`)
co2long <- yearly_co2_emissions %>% pivot_longer(-country)
co2wide <-co2long %>% pivot_wider(names_from = country, values_from = value)
US <- co2wide %>% pull(`United States`)
UK <- co2wide %>% pull(`United Kingdom`)
```

```{r,fig.width=3, fig.height=3}
# have to specify which data on each axis
# can accomodate missing data
plot(x, y)
plot(US, UK)
```

## Correlation coefficient calculation and test

```{r}
library(broom)
cor(x, y)
cor(x, y, use = "complete.obs")
cor.test(x, y)
cor(US, UK)
cor(US, UK, use = "complete.obs")
cor.test(US, UK)
```

## Broom package

The `broom` package helps make stats results look tidy

```{r}
cor_result <- tidy(cor.test(x, y))
cor_result <- tidy(cor.test(US, UK))
glimpse(cor_result)
```

## Correlation for two vectors with plot

In plot form... `geom_smooth()` and `annotate()` can help.

```{r}
```{r, warning = F}
corr_value <- pull(cor_result, estimate) %>% round(digits = 4)
cor_label <- paste0("R = ", corr_value)
yearly_co2_emissions %>%
ggplot(aes(x = `1989`, y = `2014`)) +
co2wide %>%
ggplot(aes(x = US, y = UK)) +
geom_point(size = 0.3) +
geom_smooth() +
annotate("text", x = 2000, y = 7500, label = cor_label)
Expand Down Expand Up @@ -195,7 +198,8 @@ We can compute correlation for all pairs of columns of a data frame / matrix. Th
Columns must be all numeric!

```{r}
co2_subset <- yearly_co2_emissions %>% select(c(`1909`, `1929`, `1949`, `1969`, `1989`, `2009`))
co2_subset <- co2wide %>% select(c(`United States`, `United Kingdom`, Canada))
head(co2_subset)
```

Expand Down Expand Up @@ -253,14 +257,13 @@ It tests the mean of a variable in one group. By default (i.e., without us expli
- returns result assuming confidence level 0.95 (`conf.level = 0.95`)
- omits `NA` values in data

Let's look at the dataset of haloacetic acid levels in public water sources in Washington, saved as `haa5` in the `dasehr` package.
Let's look at the UK and US CO2 emissions data again.

```{r}
x <- haa5 %>% pull(perc_pop_exposed_to_exceedances)
sum(is.na(x)) # count NAs in x
sum(is.na(US)) # count NAs in x
t.test(x)
t.test(US)
```

## Running two-sample t-test {.small}
Expand All @@ -279,28 +282,25 @@ Check out this this [case study](https://www.opencasestudies.org/ocs-bp-rural-an
## Running two-sample t-test in R

```{r}
x <- haa5 %>% pull(pop_on_sampled_PWS)
y <- haa5 %>% pull(pop_exposed_to_exceedances)
sum(is.na(US))
sum(is.na(UK)) # count NAs in x and y
sum(is.na(x))
sum(is.na(y)) # count NAs in x and y
t.test(x, y)
t.test(US, UK)
```

## T-test: retrieving information from the result with `broom` package

The `broom` package has a `tidy()` function that can organize results into a data frame so that they are easily manipulated (or nicely printed)

```{r broom, comment=""}
result <- t.test(x, y)
result <- t.test(US, UK)
result_tidy <- tidy(result)
glimpse(result_tidy)
```

## P-value adjustment {.smaller}

`r emo::ji("rotating_light")` You run an increased risk of Type I errors (a "false positive") when multiple hypotheses are tested simultaneously. `r emo::ji("rotating_light")`
You run an increased risk of Type I errors (a "false positive") when multiple hypotheses are tested simultaneously.

Use the `p.adjust()` function on a vector of p values. Use `method = ` to specify the adjustment method:

Expand Down Expand Up @@ -467,22 +467,17 @@ For example, if we want to fit a regression model where outcome is `income` and

## Linear regression

We will use our dataset about nitrate levels by quarter in public water sources in Washington. We'll load a slightly different version of this dataset, which can be found at
https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_v2_data.csv.


```{r}
nitrate <- read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_v2_data.csv", show_col_types = F)
We will use our the calenviroscreen dataset from the `dasehr` package to examine how traffic estimates predict diesel particulate emissions.

head(nitrate)
```

## Linear regression: model fitting

The upper limit of acceptable nitrates in water is 10 ug/L. We fit linear regression model with the number of people on public water systems with more than the limit of nitrates (`pop_exposed_to_exceedances`) as an outcome and the number of people exposed to 3-5 ug/L of nitrates (`pop_>3-5ug/L`) as a predictor. In other words, we are evaluating if the number of people exposed to low levels of nitrates in their water is predictive of the number of people exposed to excess nitrates in their water.
For this model, we will use two variables:
**DieselPM** - estimated diesel particulate emissions from on-road and non-road sources
**TrafficPctl** - percentile ranking of traffic density

```{r}
fit <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L`, data = nitrate)
fit <- glm(DieselPM ~ TrafficPctl, data = calenviroscreen)
fit
```

Expand All @@ -496,18 +491,18 @@ summary(fit)

## tidy results
The broom package can help us here too!
The estimate is the coefficient or slope - for one change in hours worked (1 hour increase), we see 1.58 more visits. The error for this estimate is relatively small at 0.167. This relationship appears to be significant with a small p value <0.001.
The estimate is the coefficient or slope - for one change in the traffic percentile, we see 0.003637 more Diesel particulate emissions. The error for this estimate is relatively small at 0.00009. This relationship appears to be significant with a small p value < 2e-16.

```{r}
tidy(fit) %>% glimpse()
```

## Linear regression: multiple predictors {.smaller}

Let's try adding another explanatory variable to our model, year (`year`). The tidy function will not work with this unfortunately. The meaning of coefficients is more complicated here.
Let's try adding another explanatory variable to our model, amount of daily Ozone concentration (`Ozone`). Ozone is usually inversely related to particulate measures. The tidy function will not work with this unfortunately. The meaning of coefficients is more complicated here.

```{r}
fit2 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year, data = nitrate)
fit2 <- glm(DieselPM ~ TrafficPctl + Ozone, data = calenviroscreen)
summary(fit2)
```

Expand All @@ -525,10 +520,15 @@ fit2 %>%

Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are **relative** to its values.

`quarter` takes values Q1, Q2, Q3, and Q4 to indicate the quarter of the year for each observation.
Let's create a variable that tells us whether a census tract has a high, middle, or low percentage of the population below the poverty line.

```{r}
nitrate %>% count(quarter)
calenviroscreen <- calenviroscreen %>% mutate(PovertyPctl_level = case_when(
PovertyPctl > 0.75 ~ "high",
PovertyPctl >0.25 & PovertyPctl<=0.75 ~ "middle",
PovertyPctl <=0.25 ~ "low",
TRUE ~ "unknown"
))
```


Expand All @@ -539,7 +539,7 @@ The comparison group that is not listed is treated as intercept. All other estim

```{r regressbaseline, comment="", fig.height=4,fig.width=8}
fit3 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + factor(quarter), data = nitrate)
fit3 <- glm(DieselPM ~ TrafficPctl + Ozone + factor(PovertyPctl_level), data = calenviroscreen)
summary(fit3)
```

Expand All @@ -548,13 +548,13 @@ summary(fit3)
Relative to the level is not listed.

```{r}
nitrate <- nitrate %>% mutate(quarter = factor(quarter,
calenviroscreen <- calenviroscreen %>% mutate(PovertyPctl_level = factor(PovertyPctl_level,
levels =
c(
"Q1", "Q2", "Q3", "Q4"
"low", "middle", "high", "unknown"
)
))
fit4 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter, data = nitrate)
fit4 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level, data = calenviroscreen)
summary(fit4)
```

Expand All @@ -563,7 +563,7 @@ summary(fit4)
You can view estimates for the comparison group by removing the intercept in the GLM formula `y ~ x - 1`. *Caveat* is that the p-values change.

```{r regress9, comment="", fig.height=4, fig.width=8}
fit5 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter - 1, data = nitrate)
fit5 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level - 1, data = calenviroscreen)
summary(fit5)
```

Expand All @@ -572,7 +572,7 @@ summary(fit5)
You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.

```{r fig.height=4, fig.width=8}
fit6 <- glm(pop_exposed_to_exceedances ~ `pop_>3-5ug/L` + year + quarter + year * quarter, data = nitrate)
fit6 <- glm(DieselPM ~ TrafficPctl + Ozone + PovertyPctl_level + TrafficPctl*PovertyPctl_level, data = calenviroscreen)
tidy(fit6)
```

Expand All @@ -581,11 +581,13 @@ tidy(fit6)
By default, `ggplot` with a factor added as a color will look include the interaction term. Notice the different intercept and slope of the lines.

```{r fig.height=3.5, fig.width=7, warning=FALSE}
ggplot(nitrate, aes(x = `pop_>3-5ug/L`, y = pop_exposed_to_exceedances, color = quarter)) +
geom_point(size = 1, alpha = 0.8) +
ggplot(calenviroscreen, aes(x = DieselPM, y = TrafficPctl, color = PovertyPctl_level)) +
geom_point(size = 1, alpha = 0.1) +
geom_smooth(method = "glm", se = FALSE) +
scale_color_manual(values = c("black", "grey45", "grey65", "grey85")) +
theme_classic()
theme_classic() +
ylim(0,100) +
xlim(0, 3)
```

## Generalized linear models (GLMs)
Expand All @@ -606,31 +608,31 @@ See `?family` documentation for details of family functions.

## Logistic regression {.smaller}

Let's look at a logistic regression example. We'll use the `nitrate` dataset again, but first we'll need to create a binary variable that tells us whether the percentage of the population exposed to exceedances `perc_pop_exposed_to_exceedances` is greater than 0.1%.
Let's look at a logistic regression example. We'll use the `calenviroscreen` dataset again. We will create a new binary variable based on the DieselPM percentile variable, so we can tell whether a census tract has high or low DieselPM emissions compared to the others.

```{r}
nitrate <-
nitrate %>%
calenviroscreen <-
calenviroscreen %>%
mutate(
PercExpMore0.1 = case_when
(perc_pop_exposed_to_exceedances > 0.1 ~ 1,
perc_pop_exposed_to_exceedances <= 0.1 ~ 0))
DieselPM_level = case_when
(DieselPMPctl > 0.75 ~ 1,
DieselPMPctl <= 0.75 ~ 0))
```


## Logistic regression {.smaller}

Now that we've created the `PercExpMore0.1` variable (where a `1` indicates the percentage of the population exposed to excessive nitrates is greater than 0.1%), we can run a logistic regression.
Now that we've created the `DieselPM_level` variable (where a `1` indicates the census tract is one of the top 75% when it comes to dieselPM emissions), we can run a logistic regression.

Let's explore how `quarter` and `year` might predict `PercExpMore0.1`.
Let's explore how `PovertyPctl_level` might predict `DieselPM_level`.

```
# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
```

```{r regress7, comment="", fig.height=4,fig.width=8}
binom_fit <- glm(PercExpMore0.1 ~ year + quarter, data = nitrate, family = binomial(link = "logit"))
binom_fit <- glm(DieselPM_level ~ PovertyPctl_level, data = calenviroscreen, family = binomial(link = "logit"))
summary(binom_fit)
```

Expand All @@ -645,12 +647,20 @@ Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

In this case, we're calculating the odds ratio for whether the the first 6 months or last 6 months predicts whether the percentage of population exposed to excess nitrates is more than 0.1%
In this case, we're calculating the odds ratio for whether living in a high traffic area predicts high levels of DieselPM.

```{r}
library(epitools)
response <- nitrate %>% pull(PercExpMore0.1)
predictor <- nitrate %>% pull(half_of_year)
calenviroscreen <-
calenviroscreen %>%
mutate(
Traffic_level = case_when
(TrafficPctl > 0.75 ~ 1,
TrafficPctl <= 0.75 ~ 0))
response <- calenviroscreen %>% pull(DieselPM_level)
predictor <- calenviroscreen %>% pull(Traffic_level)
oddsratio(predictor, response)
```

Expand Down
Loading

0 comments on commit 49ef856

Please sign in to comment.