More minor tweaks to stats module

lter · May 16, 2024 · 39d4a2c · 39d4a2c
1 parent ee13752
commit 39d4a2c
Show file tree

Hide file tree

Showing 3 changed files with 66 additions and 8 deletions.
diff --git a/_freeze/mod_stats/execute-results/html.json b/_freeze/mod_stats/execute-results/html.json
@@ -1,9 +1,11 @@
 {
-  "hash": "1f5e37b15693ec1e162ad5b7626629f9",
+  "hash": "5cb52db8a62220efc6366db0556875be",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a \"flipped approach\" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the [Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources) section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- <u>Describe</u> proposed analytical methods to an interested audience of mixed prior experience\n- <u>Explain</u> nuance in interpretation of results of proposed analyses\n- <u>Compare</u> and contrast interpretation of results in synthesis work versus primary research\n- <u>Identify</u> statistical tests common in synthesis research\n- <u>Perform</u> some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations in order to gather enough replicates to do robust analysis. These variables are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks <u>what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations are accounted for_</u>?' Such tests are called \"mixed-effects models.\" This name derives from considering explanatory variables \"fixed effects\" and these non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas (response variable) will be greater in sites further from the nearest road (explanatory variable). We select ten study sites of varying distances from the nearest road and intensively count our furry friends for several months at three plots within each site. We return to our sites--and their associated plots--and repeat this process each year for three years. After entering our data in MS Excel here's what we walk away with.\n\n\n\nWith our data in hand we now want to run some statistical tests and--hopefully--get some endorphine-inducing small _p_ values.\n\nIf we choose to simply ignore our possible random effects, our model would look something like this:\n\n```\ntarantula_count ~ road_dist_km\n```\n\n\n\nHowever, from our extensive time in the field we have developed a strong intuition that \n\n\n\n## Multi-Model Inference\n\n\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n",
-    "supporting": [],
+    "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a \"flipped approach\" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the [Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources) section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- <u>Describe</u> proposed analytical methods to an interested audience of mixed prior experience\n- <u>Explain</u> nuance in interpretation of results of proposed analyses\n- <u>Compare</u> and contrast interpretation of results in synthesis work versus primary research\n- <u>Identify</u> statistical tests common in synthesis research\n- <u>Perform</u> some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"nlme\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\nlibrary(nlme)\n```\n:::\n\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations in order to gather enough replicates to do robust analysis. These variables are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks <u>what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations are accounted for_</u>?' Such tests are called \"mixed-effects models.\" This name derives from considering explanatory variables \"fixed effects\" and these non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas (response variable) will be greater in sites further from the nearest road (explanatory variable). We select ten study sites of varying distances from the nearest road and intensively count our furry friends for several months at three plots within each site. We return to our sites--and their associated plots--and repeat this process each year for three years. We enter our data and perform careful quality control to get it into a tidy format ready for analyis\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntarantula_df <- read.csv(file = file.path(\"data\", \"tarantulas.csv\"))\n\n# Check structure\nstr(tarantula_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of  5 variables:\n $ year           : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ site           : chr  \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ road_dist_km   : num  44 44 44 44 44 ...\n $ plot           : chr  \"plot_1\" \"plot_1\" \"plot_1\" \"plot_1\" ...\n $ tarantula_count: int  88 66 84 97 84 22 27 9 27 18 ...\n```\n\n\n:::\n:::\n\n\nOkay, the data look reasonable, but now we want to run some statistical tests and--hopefully--get some endorphine-inducing small _p_ values. If we choose to simply ignore our possible random effects, we could fit a simple linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\ntarantula_lm <- lm(tarantula_count ~ road_dist_km, data = tarantula_df) # <1>\n\n# Extract summary\nsummary(tarantula_lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = tarantula_count ~ road_dist_km, data = tarantula_df)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-108.648  -51.341    9.083   42.645  100.352 \n\nCoefficients:\n             Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.3509    11.5675   0.030    0.976    \nroad_dist_km   1.6604     0.2244   7.398 6.89e-13 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 52.9 on 448 degrees of freedom\nMultiple R-squared:  0.1089,\tAdjusted R-squared:  0.1069 \nF-statistic: 54.73 on 1 and 448 DF,  p-value: 6.889e-13\n```\n\n\n:::\n:::\n\n1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`\n\nLooks like our hypothesis is supported and there does seem to be an effect of human proximity on tarantula populations in our study site. However, from our extensive time in the field we have developed a strong intuition that there might be substantial variation among plots within a given site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +\n  geom_violin(alpha = 0.5) + # <1>\n  geom_jitter(width = 0.25, size = 0.5) +\n  facet_wrap(site ~ .) +\n  theme_bw() +\n  theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-explore-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot\n2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another\n\nThis plot clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (think 'statistical power') for a variable that we don't actually care about! Instead, we could include plot as a random effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Do some minor needed wrangling\ntarantula_v2 <- tarantula_df %>% \n  dplyr::mutate(site_plot = paste(gsub(pattern = \"site_\", replacement = \"\", x = site), \n                                  gsub(pattern = \"plot_\", replacement = \"\", x = plot),\n                                  sep = \"_\"),\n                .before = site)\n\n# Re-check structure\nstr(tarantula_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of  6 variables:\n $ year           : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ site_plot      : chr  \"A_1\" \"A_1\" \"A_1\" \"A_1\" ...\n $ site           : chr  \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ road_dist_km   : num  44 44 44 44 44 ...\n $ plot           : chr  \"plot_1\" \"plot_1\" \"plot_1\" \"plot_1\" ...\n $ tarantula_count: int  88 66 84 97 84 22 27 9 27 18 ...\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem1 <- nlme::lme(tarantula_count ~ road_dist_km,\n                            random = ~ 1|plot,\n                            data = tarantula_v2)\n\n# Extract summary\nsummary(tarantula_mem1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed-effects model fit by REML\n  Data: tarantula_v2 \n       AIC      BIC    logLik\n  4808.433 4824.852 -2400.217\n\nRandom effects:\n Formula: ~1 | plot\n        (Intercept) Residual\nStdDev:     21.0756 50.01222\n\nFixed effects:  tarantula_count ~ road_dist_km \n                 Value Std.Error  DF  t-value p-value\n(Intercept)  0.3508894 16.359889 446 0.021448  0.9829\nroad_dist_km 1.6603788  0.212184 446 7.825189  0.0000\n Correlation: \n             (Intr)\nroad_dist_km -0.653\n\nStandardized Within-Group Residuals:\n        Min          Q1         Med          Q3         Max \n-1.71902526 -1.04550811  0.05769425  0.89874793  1.74986079 \n\nNumber of Observations: 450\nNumber of Groups: 3 \n```\n\n\n:::\n:::\n\n\n\n## Multi-Model Inference\n\n\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n",
+    "supporting": [
+      "mod_stats_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/_freeze/mod_stats/figure-html/mem-explore-graph-1.png b/_freeze/mod_stats/figure-html/mem-explore-graph-1.png
diff --git a/mod_stats.qmd b/mod_stats.qmd
@@ -27,6 +27,17 @@ If you'd like to follow along with the code chunks included throughout this modu
 # Note that these lines only need to be run once per computer
 ## So you can skip this step if you've installed these before
 install.packages("tidyverse")
+install.packages("nlme")
+```
+
+We'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.
+
+```{r libs}
+#| message: false
+
+# Load needed libraries
+library(tidyverse)
+library(nlme)
 ```
 
 ## Mixed-Effects Models
@@ -37,22 +48,67 @@ In essence we want to use a statistical tool that asks <u>what is the effect of
 
 ### Mixed-Effects Case Study
 
-Let's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas (response variable) will be greater in sites further from the nearest road (explanatory variable). We select ten study sites of varying distances from the nearest road and intensively count our furry friends for several months at three plots within each site. We return to our sites--and their associated plots--and repeat this process each year for three years. After entering our data in MS Excel here's what we walk away with.
+Let's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas (response variable) will be greater in sites further from the nearest road (explanatory variable). We select ten study sites of varying distances from the nearest road and intensively count our furry friends for several months at three plots within each site. We return to our sites--and their associated plots--and repeat this process each year for three years. We enter our data and perform careful quality control to get it into a tidy format ready for analyis
 
+```{r mem-data}
+# Read in data
+tarantula_df <- read.csv(file = file.path("data", "tarantulas.csv"))
 
+# Check structure
+str(tarantula_df)
+```
 
-With our data in hand we now want to run some statistical tests and--hopefully--get some endorphine-inducing small _p_ values.
+Okay, the data look reasonable, but now we want to run some statistical tests and--hopefully--get some endorphine-inducing small _p_ values. If we choose to simply ignore our possible random effects, we could fit a simple linear regression.
 
-If we choose to simply ignore our possible random effects, our model would look something like this:
+```{r mem-simple-lm}
+# Fit model
+tarantula_lm <- lm(tarantula_count ~ road_dist_km, data = tarantula_df) # <1>
 
+# Extract summary
+summary(tarantula_lm)
 ```
-tarantula_count ~ road_dist_km
+1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`
+
+Looks like our hypothesis is supported and there does seem to be an effect of human proximity on tarantula populations in our study site. However, from our extensive time in the field we have developed a strong intuition that there might be substantial variation among plots within a given site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.
+
+```{r mem-explore-graph}
+#| fig-align: center
+#| fig-width: 7
+#| fig-height: 6
+
+ggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +
+  geom_violin(alpha = 0.5) + # <1>
+  geom_jitter(width = 0.25, size = 0.5) +
+  facet_wrap(site ~ .) +
+  theme_bw() +
+  theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>
 ```
+1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot
+2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another
 
+This plot clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (think 'statistical power') for a variable that we don't actually care about! Instead, we could include plot as a random effect.
 
+```{r mem-mix1-prep}
+# Do some minor needed wrangling
+tarantula_v2 <- tarantula_df %>% 
+  dplyr::mutate(site_plot = paste(gsub(pattern = "site_", replacement = "", x = site), 
+                                  gsub(pattern = "plot_", replacement = "", x = plot),
+                                  sep = "_"),
+                .before = site)
 
-However, from our extensive time in the field we have developed a strong intuition that 
+# Re-check structure
+str(tarantula_v2)
+```
 
+```{r mem-mix1}
+# Fit the new model
+tarantula_mem1 <- nlme::lme(tarantula_count ~ road_dist_km,
+                            random = ~ 1|plot,
+                            data = tarantula_v2)
+
+# Extract summary
+summary(tarantula_mem1)
+```
 
 
 ## Multi-Model Inference