Skip to content

Commit

Permalink
Merge pull request #173 from fhdsl/adding-variable-explanations-to-st…
Browse files Browse the repository at this point in the history
…atistics-lab

[Statistics] Updating confusing language in lab
  • Loading branch information
avahoffman authored Oct 9, 2024
2 parents 63daea1 + 45fcc5e commit a156ce6
Show file tree
Hide file tree
Showing 3 changed files with 175 additions and 103 deletions.
139 changes: 75 additions & 64 deletions modules/Statistics/Statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,11 @@ Function `cor()` computes correlation in R.
cor(x, y = NULL, use = c("everything", "complete.obs"),
method = c("pearson", "kendall", "spearman"))
```
- provide two numeric vectors of the same length (arguments `x`, `y`), or
- provide a data.frame / tibble with numeric columns only
- by default, Pearson correlation coefficient is computed
<br>

- provide two numeric vectors of the same length (arguments `x`, `y`), or
- provide a data.frame / tibble with numeric columns only
- by default, Pearson correlation coefficient is computed

## Correlation test

Expand All @@ -110,34 +112,43 @@ cor.test(x, y = NULL, alternative(c("two.sided", "less", "greater")),
- greater means true correlation coefficient is > 0 (positive relationship)
- less means true correlation coefficient is < 0 (negative relationship)

## GUT CHECK!

What class of data do you need to calculate a correlation?

## Correlation {.small}
A. Character data

B. Factor data

C. Numeric data

## Correlation {.codesmall}

Let's look at the dataset of yearly CO2 emissions by country.

```{r cor1, comment="", message = FALSE}
yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
yearly_co2 <-
read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
```

## Correlation for two vectors

First, we compute correlation by providing two vectors.
First, we create two vectors.

```{r}
# x and y must be numeric vectors
y1980 <- yearly_co2_emissions %>% pull(`1980`)
y1985 <- yearly_co2_emissions %>% pull(`1985`)
y1980 <- yearly_co2 %>% pull(`1980`)
y1985 <- yearly_co2 %>% pull(`1985`)
```

<br>

Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.
Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify `use = "complete.obs"`, then it will give you correlation using the non-missing data.

```{r}
cor(y1980, y1985, use = "complete.obs")
```


## Correlation coefficient calculation and test

```{r}
Expand All @@ -156,46 +167,31 @@ glimpse(cor_result)

## Correlation for two vectors with plot{.codesmall}

In plot form... `geom_smooth()` and `annotate()` can help.
In plot form... `geom_smooth()` and `annotate()` can look very nice!

```{r, warning = F}
corr_value <- pull(cor_result, estimate) %>% round(digits = 4)
cor_label <- paste0("R = ", corr_value)
yearly_co2_emissions %>%
yearly_co2 %>%
ggplot(aes(x = `1980`, y = `1985`)) + geom_point(size = 1) + geom_smooth() +
annotate("text", x = 2000000, y = 4000000, label = cor_label)
```

<!-- ## Plotting with `ggpubr` -->

<!-- In plot form... `geom_smooth()` of `ggplot2` can help, as can `stat_cor()` of `ggpubr`. -->
<!-- ```{r, fig.width=3, fig.height=3} -->
<!-- install.packages("ggpubr") -->

<!-- library(ggpubr) -->
<!-- yearly_co2_emissions %>% -->
<!-- ggplot(aes(x = `1989`, y = `2014`)) + -->
<!-- geom_point(size = 0.3) + -->
<!-- geom_smooth() + -->
<!-- stat_cor(p.accuracy = 0.001) -->
<!-- ``` -->



## Correlation for data frame columns

We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.

Columns must be all numeric!

```{r}
co2_subset <- yearly_co2_emissions %>%
co2_subset <- yearly_co2 %>%
select(c(`1950`, `1980`, `1985`, `2010`))
head(co2_subset)
```

## Correlation for data frame columns

We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.

```{r}
Expand Down Expand Up @@ -226,12 +222,13 @@ knitr::include_graphics(here::here("images/lyme_and_fried_chicken.png"))

## T-test

The commonly used are:
The commonly used t-tests are:

- **one-sample t-test** -- used to test mean of a variable in one group
- **two-sample t-test** -- used to test difference in means of a variable between two groups (if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)
- **one-sample t-test** -- used to test mean of a variable in one group
- **two-sample t-test** -- used to test difference in means of a variable between two groups
- if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)

The `t.test()` function in R is one to address the above.
The `t.test()` function does both.

```
t.test(x, y = NULL,
Expand Down Expand Up @@ -302,7 +299,7 @@ See [here](https://www.nature.com/articles/nbt1209-1135) for more about multiple
## Some other statistical tests

- `wilcox.test()` -- Wilcoxon signed rank test, Wilcoxon rank sum test
- `shapiro.test()` -- Shapiro test
- `shapiro.test()` -- Test normality assumptions
- `ks.test()` -- Kolmogorov-Smirnov test
- `var.test()`-- Fisher’s F-Test
- `chisq.test()` -- Chi-squared test
Expand Down Expand Up @@ -399,6 +396,8 @@ We typically provide two arguments:
- `formula` -- model formula written using names of columns in our data
- `data` -- our data frame

<!-- Note that regression coefficients are unstandardized, meaning they are in the original units of the variables, while standardized coefficients are unitless and based on standard deviations. -->

## Linear regression fit in R: model formula

Model formula
Expand Down Expand Up @@ -451,13 +450,17 @@ For example, if we want to fit a regression model where outcome is `income` and
`income ~ years_of_education + age + location`
</p>

## Linear regression
## Linear regression example

Let's look variables that might be able to predict the number of heat-related ER visits in Colorado.

We'll use the dataset that has ER visits separated out by age category.

We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat). We will use this as a proxy for temperatures for the state of CO as a whole for this example.
We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat).

We will use this as a proxy for temperatures for the state of CO as a whole for this example.

## Linear regression example{.codesmall}

```{r}
er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits_by_age.csv")
Expand Down Expand Up @@ -530,7 +533,6 @@ er_temps <- er_temps %>% mutate(age = factor(age))
```



## Linear regression: factors {.smaller}

The comparison group that is not listed is treated as intercept. All other estimates are relative to the intercept.
Expand All @@ -545,11 +547,13 @@ summary(fit3)

Maybe we want to use the age group "65+ years" as our reference. We can relevel the factor.

Relative to the level is not listed.
The ages are relative to the level that is not listed.

```{r}
er_temps <- er_temps %>% mutate(age = factor(age,
levels = c("65+ years old", "35-64 years old", "15-34 years old", "5-14 years old", "0-4 years old")
er_temps <-
er_temps %>%
mutate(age = factor(age,
levels = c("65+ years", "35-64 years", "15-34 years", "5-14 years", "0-4 years")
))
fit4 <- glm(visits ~ highest_temp + year + age, data = er_temps)
Expand All @@ -562,13 +566,21 @@ You can view estimates for the comparison group by removing the intercept in the

`y ~ x - 1`

*Caveat* is that the p-values change.
*Caveat* is that the p-values change, and interpretation is often confusing.

```{r regress9, comment="", fig.height=4, fig.width=8}
fit5 <- glm(visits ~ highest_temp + year + age - 1, data = er_temps)
summary(fit5)
```

## Linear regression: interactions

```{r, fig.alt="Statistical interaction showing the relationship between cookie yield, temperature, and cooking duration.", out.width = "70%", echo = FALSE, fig.align='center'}
knitr::include_graphics("images/interaction.png")
```

[source](https://en.wikipedia.org/wiki/Interaction_(statistics)#/media/File:Interaction_plot_cookie_baking.svg)

## Linear regression: interactions {.smaller}

You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.
Expand Down Expand Up @@ -614,19 +626,15 @@ We will create a new binary variable `high_rate`. We will say a visit rate of mo

```{r}
er_temps <-
er_temps %>%
mutate(
high_rate = case_when(
rate > 8 ~ 1,
TRUE ~ 0))
er_temps %>% mutate(high_rate = rate > 8)
glimpse(er_temps)
```


## Logistic regression {.smaller}

Now that we've created the `high_rate` variable (where a 1 indicates that there was a visit rate greater than 8), we can run a logistic regression.

Let's explore how `highest_temp`, `year`, and `age` might predict `high rate`.
Let's explore how `highest_temp`, `year`, and `age` might predict `high_rate`.

```
# General format
Expand All @@ -649,39 +657,40 @@ See this [case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/#
Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

## Odds ratios {.smaller}

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

During the years 2012, 2018, 2021, and 2022, there were multiple consecutive days with temperatures above 100 degrees. We will code this as `heatwave`.

In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8.

```{r}
library(epitools)
er_temps <-
er_temps %>%
mutate(
heatwave = case_when(
year == 2012 ~ 1,
year == 2018 ~ 1,
year == 2021 ~ 1,
year == 2022 ~ 1,
TRUE ~ 0))
mutate(heatwave = year %in% c(2012, 2018, 2021, 2022))
glimpse(er_temps)
```

## Odds ratios {.smaller}

In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8.

```{r}
response <- er_temps %>% pull(high_rate)
predictor <- er_temps %>% pull(heatwave)
oddsratio(predictor, response)
```

## Odds ratios {.smaller}

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.
The Odds Ratio is 3.86.

In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8.
When the predictor is TRUE (aka it was a heatwave year), the odds of the response (high hospital visitation) are 3.86 times greater than when it is FALSE (not a heatwave year).

```{r}
```{r echo = FALSE}
oddsratio(predictor, response)
```

Expand Down Expand Up @@ -719,7 +728,9 @@ Content for similar topics as this course can also be found on Leanpub.

💻 [Lab](https://daseh.org/modules/Statistics/lab/Statistics_Lab.Rmd)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
📃 [Day 8 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-8.pdf)

```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Expand Down
Binary file added modules/Statistics/images/interaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit a156ce6

Please sign in to comment.