Merge pull request #173 from fhdsl/adding-variable-explanations-to-st…

…atistics-lab [Statistics] Updating confusing language in lab
fhdsl · Oct 9, 2024 · a156ce6 · a156ce6
2 parents 63daea1 + 45fcc5e
commit a156ce6
Show file tree

Hide file tree

Showing 3 changed files with 175 additions and 103 deletions.
diff --git a/modules/Statistics/Statistics.Rmd b/modules/Statistics/Statistics.Rmd
@@ -90,9 +90,11 @@ Function `cor()` computes correlation in R.
 cor(x, y = NULL, use = c("everything", "complete.obs"),
     method = c("pearson", "kendall", "spearman"))
 ```
-- provide two numeric vectors of the same length (arguments `x`, `y`), or
-- provide a data.frame / tibble with numeric columns only
-- by default, Pearson correlation coefficient is computed
+<br>
+
+- provide two numeric vectors of the same length (arguments `x`, `y`), or  
+- provide a data.frame / tibble with numeric columns only  
+- by default, Pearson correlation coefficient is computed  
 
 ## Correlation test
 
@@ -110,34 +112,43 @@ cor.test(x, y = NULL, alternative(c("two.sided", "less", "greater")),
    - greater means true correlation coefficient is > 0 (positive relationship)
    - less means true correlation coefficient is < 0 (negative relationship)
 
+## GUT CHECK!
+
+What class of data do you need to calculate a correlation?
 
-## Correlation {.small}
+A. Character data
+
+B. Factor data
+
+C. Numeric data
+
+## Correlation {.codesmall}
 
 Let's look at the dataset of yearly CO2 emissions by country.
 
 ```{r cor1, comment="", message = FALSE}
-yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
+yearly_co2 <- 
+  read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
 ```
 
 ## Correlation for two vectors
 
-First, we compute correlation by providing two vectors.
+First, we create two vectors.
 
 ```{r}
 # x and y must be numeric vectors
-y1980 <- yearly_co2_emissions %>% pull(`1980`)
-y1985 <- yearly_co2_emissions %>% pull(`1985`)
+y1980 <- yearly_co2 %>% pull(`1980`)
+y1985 <- yearly_co2 %>% pull(`1985`)
 ```
 
 <br>
 
-Like other functions, if there are `NA`s, you get `NA` as the result.  But if you specify use only the complete observations, then it will give you correlation using the non-missing data.
+Like other functions, if there are `NA`s, you get `NA` as the result.  But if you specify `use = "complete.obs"`, then it will give you correlation using the non-missing data.
 
 ```{r}
 cor(y1980, y1985, use = "complete.obs")
 ```
 
-
 ## Correlation coefficient calculation and test
 
 ```{r}
@@ -156,46 +167,31 @@ glimpse(cor_result)
 
 ## Correlation for two vectors with plot{.codesmall}
 
-In plot form... `geom_smooth()` and `annotate()` can help.
+In plot form... `geom_smooth()` and `annotate()` can look very nice!
 
 ```{r, warning = F}
 corr_value <- pull(cor_result, estimate) %>% round(digits = 4)
 cor_label <- paste0("R = ", corr_value)
-yearly_co2_emissions %>%
+yearly_co2 %>%
   ggplot(aes(x = `1980`, y = `1985`)) + geom_point(size = 1) + geom_smooth() +
   annotate("text", x = 2000000, y = 4000000, label = cor_label)
 ```
 
-<!-- ## Plotting with `ggpubr` -->
-
-<!-- In plot form... `geom_smooth()` of `ggplot2` can help, as can `stat_cor()` of `ggpubr`. -->
-<!-- ```{r, fig.width=3, fig.height=3} -->
-<!-- install.packages("ggpubr") -->
-
-<!-- library(ggpubr) -->
-<!-- yearly_co2_emissions %>% -->
-<!--   ggplot(aes(x = `1989`, y = `2014`)) + -->
-<!--   geom_point(size = 0.3) + -->
-<!--   geom_smooth() + -->
-<!--   stat_cor(p.accuracy = 0.001) -->
-<!-- ``` -->
-
-
-
 ## Correlation for data frame columns
 
 We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.
 
 Columns must be all numeric!
 
 ```{r}
-co2_subset <- yearly_co2_emissions %>%
+co2_subset <- yearly_co2 %>%
   select(c(`1950`, `1980`, `1985`, `2010`))
 
 head(co2_subset)
 ```
 
 ## Correlation for data frame columns
+
 We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.
 
 ```{r}
@@ -226,12 +222,13 @@ knitr::include_graphics(here::here("images/lyme_and_fried_chicken.png"))
 
 ## T-test
 
-The commonly used are:
+The commonly used t-tests are:
 
-- **one-sample t-test** -- used to test mean of a variable in one group
-- **two-sample t-test** -- used to test difference in means of a variable between two groups (if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)
+- **one-sample t-test** -- used to test mean of a variable in one group 
+- **two-sample t-test** -- used to test difference in means of a variable between two groups
+    - if the "two groups" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test)
 
-The `t.test()` function in R is one to address the above.
+The `t.test()` function does both.
 
 ```
 t.test(x, y = NULL,
@@ -302,7 +299,7 @@ See [here](https://www.nature.com/articles/nbt1209-1135) for more about multiple
 ## Some other statistical tests
 
 - `wilcox.test()` -- Wilcoxon signed rank test, Wilcoxon rank sum test
-- `shapiro.test()` -- Shapiro test
+- `shapiro.test()` -- Test normality assumptions 
 - `ks.test()` -- Kolmogorov-Smirnov test
 - `var.test()`-- Fisher’s F-Test
 - `chisq.test()` -- Chi-squared test
@@ -399,6 +396,8 @@ We typically provide two arguments:
 - `formula` -- model formula written using names of columns in our data
 - `data` -- our data frame
 
+<!-- Note that regression coefficients are unstandardized, meaning they are in the original units of the variables, while standardized coefficients are unitless and based on standard deviations. -->
+
 ## Linear regression fit in R: model formula
 
 Model formula
@@ -451,13 +450,17 @@ For example, if we want to fit a regression model where outcome is `income` and
 `income ~ years_of_education + age + location`
 </p>
 
-## Linear regression
+## Linear regression example
 
 Let's look variables that might be able to predict the number of heat-related ER visits in Colorado.
 
 We'll use the dataset that has ER visits separated out by age category.
 
-We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat). We will use this as a proxy for temperatures for the state of CO as a whole for this example.
+We'll combine this with a new dataset that has some weather information about summer temperatures in Denver, downloaded from [https://www.weather.gov/bou/DenverSummerHeat](https://www.weather.gov/bou/DenverSummerHeat). 
+
+We will use this as a proxy for temperatures for the state of CO as a whole for this example.
+
+## Linear regression example{.codesmall}
 
 ```{r}
 er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits_by_age.csv")
@@ -530,7 +533,6 @@ er_temps <- er_temps %>% mutate(age = factor(age))
 ```
 
 
-
 ## Linear regression: factors {.smaller}
 
 The comparison group that is not listed is treated as intercept. All other estimates are relative to the intercept. 
@@ -545,11 +547,13 @@ summary(fit3)
 
 Maybe we want to use the age group "65+ years" as our reference. We can relevel the factor.
 
-Relative to the level is not listed.
+The ages are relative to the level that is not listed.
 
 ```{r}
-er_temps <- er_temps %>% mutate(age = factor(age,
-    levels = c("65+ years old", "35-64 years old", "15-34 years old", "5-14 years old", "0-4 years old")
+er_temps <- 
+  er_temps %>% 
+  mutate(age = factor(age,
+    levels = c("65+ years", "35-64 years", "15-34 years", "5-14 years", "0-4 years")
   ))
   
 fit4 <- glm(visits ~ highest_temp + year + age, data = er_temps)
@@ -562,13 +566,21 @@ You can view estimates for the comparison group by removing the intercept in the
 
 `y ~ x - 1`
 
-*Caveat* is that the p-values change.
+*Caveat* is that the p-values change, and interpretation is often confusing.
 
 ```{r regress9, comment="", fig.height=4, fig.width=8}
 fit5 <- glm(visits ~ highest_temp + year + age - 1, data = er_temps)
 summary(fit5)
 ```
 
+## Linear regression: interactions
+
+```{r, fig.alt="Statistical interaction showing the relationship between cookie yield, temperature, and cooking duration.", out.width = "70%", echo = FALSE, fig.align='center'}
+knitr::include_graphics("images/interaction.png")
+```
+
+[source](https://en.wikipedia.org/wiki/Interaction_(statistics)#/media/File:Interaction_plot_cookie_baking.svg)
+
 ## Linear regression: interactions {.smaller}
 
 You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.
@@ -614,19 +626,15 @@ We will create a new binary variable `high_rate`. We will say a visit rate of mo
 
 ```{r}
 er_temps <-
-  er_temps %>%
-    mutate(
-      high_rate = case_when(
-      rate > 8 ~ 1,
-      TRUE ~ 0))
+  er_temps %>% mutate(high_rate = rate > 8)
+
+glimpse(er_temps)
 ```
 
 
 ## Logistic regression {.smaller}
 
-Now that we've created the `high_rate` variable (where a 1 indicates that there was a visit rate greater than 8), we can run a logistic regression.
-
-Let's explore how `highest_temp`, `year`, and `age` might predict `high rate`.
+Let's explore how `highest_temp`, `year`, and `age` might predict `high_rate`.
 
 ```
 # General format
@@ -649,39 +657,40 @@ See this [case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/#
 
 Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).
 
+Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.
 
 ## Odds ratios {.smaller}
 
-Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.
-
 During the years 2012, 2018, 2021, and 2022, there were multiple consecutive days with temperatures above 100 degrees. We will code this as `heatwave`.
 
-In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8.
-
 ```{r}
 library(epitools)
 
 er_temps <-
   er_temps %>%
-    mutate(
-      heatwave = case_when(
-      year == 2012 ~ 1,
-      year == 2018 ~ 1,
-      year == 2021 ~ 1,
-      year == 2022 ~ 1,
-      TRUE ~ 0))
+  mutate(heatwave = year %in% c(2012, 2018, 2021, 2022))
+
+glimpse(er_temps)
+```
 
+## Odds ratios {.smaller}
+
+In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8. 
+
+```{r}
 response <- er_temps %>% pull(high_rate)
 predictor <- er_temps %>% pull(heatwave)
+
+oddsratio(predictor, response)
 ```
 
 ## Odds ratios {.smaller}
 
-Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.
+The Odds Ratio is 3.86.
 
-In this case, we're calculating the odds ratio for whether a heatwave is associated with having a visit rate greater than 8. 
+When the predictor is TRUE (aka it was a heatwave year), the odds of the response (high hospital visitation) are 3.86 times greater than when it is FALSE (not a heatwave year).
 
-```{r}
+```{r echo = FALSE}
 oddsratio(predictor, response)
 ```
 
@@ -719,7 +728,9 @@ Content for similar topics as this course can also be found on Leanpub.
 
 💻 [Lab](https://daseh.org/modules/Statistics/lab/Statistics_Lab.Rmd)
 
-```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
+📃 [Day 8 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-8.pdf)
+
+```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
 knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
 ```
 

diff --git a/modules/Statistics/images/interaction.png b/modules/Statistics/images/interaction.png