diff --git a/_freeze/mod_stats/execute-results/html.json b/_freeze/mod_stats/execute-results/html.json index 4ee5a14..949b690 100644 --- a/_freeze/mod_stats/execute-results/html.json +++ b/_freeze/mod_stats/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "13456feb8e13c6ea620b5dce556996c4", + "hash": "6bbae5bc523685a6909cfe196fbfa7cd", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a \"flipped approach\" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Describe proposed analytical methods to an interested audience of mixed prior experience\n- Explain nuance in interpretation of results of proposed analyses\n- Identify some statistical tests common in synthesis research\n- Perform some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lmerTest\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\nlibrary(lmerTest)\n```\n:::\n\n\n## Hypothesis Framing Aside\n\nBefore we dive in, we should discuss two of the ways in which you can frame your hypothesis and the differences in interpretation and appropriate statistical tool(s) that follow from that choice. We'll restrict our conversation here to **two alternate modes of thinking about your hypothesis: frequentist statistics versus multi-model inference.**\n\nNote that this is something of a false dichotomy as tools from both worlds can be/are frequently used to complement one another. However, many graduate students are trained by instructors with strong feelings about one method _in opposition to_ the other so it is worthwhile to consider these two paths separately even if you wind up using components of both in your own work.\n\n::::{.panel-tabset}\n### Frequentist Inference\n\nHypotheses here are a question of whether a variable has a \"significant\" effect on another. \"Significant\" has a very precise meaning in this context that has to do with _p_-values. Fundamentally, these methods focus on whether the observed relationship in the data is likely to be observed by chance alone or not. Strong effects are less likely--though not impossible--to be observed due to random chance.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that X affects Y' then frequentist inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of frequentist inference will focus on mixed-effect models**.\n\n### Multi-Model Inference\n\nHyoptheses here are a question of which variables explain the _most_ variation in the data. Methods in this framing are unconcerned--or at least less concerned than in frequentist inference--with the probability associated with a particular variable. Intead, these methods focus on which of a set of user-defined candidate models explains most of the noise in the data _even when that best model does not necessarily explain much of that variation in absolute terms_.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that models including X explain more of the variation in Y than those that do not' then multi-model inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of multi-model inference will focus on comparing model strengths with AIC**.\n\n::::\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations. These variables (time or site) are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks 'what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations is accounted for_?' Such tests are called **mixed-effects models**. This name derives from considering explanatory variables \"fixed effects\" and non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Types of Random Effect\n\nThere are a few types of random effects but we can limit our conversation here to just two: random intercepts and random slopes.\n\n:::{.panel-tabset}\n#### Random Intercepts\n\nRandom intercepts should be used when you expect that the average response differs among levels of that variable but not in a way that changes the relationship between each level of this variable and the other variables (either fixed or random). In statistical terms you want to allow the intercept to change with levels of this variable.\n\nFor example, let's imagine that we are studying the effect of different organic farming practices on beneficial insect populations. We build relationships with several organic farmers willing to let us conduct this research on their properties and sample the insect communities at each farm over the course of a summer. However, we know that each farm is surrounded by a different habitat type that affects the composition of the local insect community. It is reasonable to expect that even farms where 'the same' management method is used are likely to differ because of this difference in landscape context.\n\nIn cases like this, we don't want to include a term for 'site' as a fixed effect but we do want to account for those differences so that our assessment of the significance of our explanatory variables isn't limited by the variation due to site.\n\n#### Random Slopes\n\nRandom slopes should be used when you expect that the average response differs among levels of that variable in a way that does change with other variables.\n\nFor example, let's imagine that we are studying the effect of temperature on avian malaria rates in songbirds. We identify several sites--along a gradient of daytime temperature ranges--where our species of interest can be found, capture them, and measure malaria infection rates. However, unbeknownst to us at the start of our study, our study sites have varying populations of dragonflies which affects local mosquito populations and malaria transmission/infection rates. If we revisit our sites repeatedly for several years is is reasonable to expect that this difference among sites likely affects the _relationship between_ daytime temperatures and songbird malaria rates.\n\nBy including site as a random slope in this context, we can account for this effect and still analyze our explanatory variables of interest. Note that random slopes are _very_ \"data hungry\" so you may not be able to use them without very high replication in your study design.\n:::\n\n### Nested Random Effects\n\nTo further complicate matters, we can use nested random effects as well. These can be either random intercepts or random slopes though they are more commonly seen with random intercepts. A nested random effect accounts for the effect of one random variable that _is itself affected by another variable!_ A classic example of this is when a study design uses two (or more) levels of spatial nestedness in their experimentall design.\n\nFor instance, let's imagine we were conducting a global study of marine plankton biodiversity. To gether these data we took several cruises (scientific not--exclusively--pleasure) at different places around the world and during each cruise we followed a set of transects. In each transect we did several plankton tows and quantified the diversity of each tow. We can reasonably assume the following:\n\n1. Each cruise differs from each other cruise (due to any number of climatic/ecological factors)\n - But cruises within the same part of the world are still likely to have similar planktonic communities\n2. Within each cruise, each transect differs from the others (again, due to unpreventable factors)\n - But transects within the same cruise are still likely to be more similar to one another than to transects in different cruises (even other ones in the same region!)\n3. Within each transect, each plankton tow differs from one another!\n - But again, more similar to other tows in the same transect than other tows in different transects/cruises\n\nIf we put these assumptions together we realize we want to account for the variation of cruise, transect, and tow while still retaining the nestedness of the similarity among samples. A nested random effect where transect is nested inside of cruise and tow is nested inside of transect would capture this effectively!\n\n### Philosophical Note: Random vs. Fixed\n\nDeciding whether a given variable should be a fixed or random effect can be tough. You'll likely need to rely on your scientific intuition about which feels more appropriate and then be prepared to defend that decision to your committee and/or \"reviewer \\#2\". It may prove helpful though to consider whether you 'care' about the effect of that variable. \n\nIf your hypothesis includes that variable than it should likely be a fixed effect. If the variable is just a facet of your experimental design but isn't something you're necessarily interested in testing, then it should likely be a random effect. And, once you've made your decision, it is totally okay to change your mind and tweak the structure of your model!\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Random versus Fixed Effects\n\nWith a small group, decide whether you think the terms in the examples below should be fixed effects or random effects:\n\n- You study small mammal populations in urban settings\n - Should 'proportion green space' be a fixed effect or a random effect?\n- You are part of a team studying leopard seal feeding behavior\n - What type of effect should 'observer' be?\n- You study the gut microbiota of a particular beetle species\n - Should 'beetle sex' be a fixed or a random effect?\n - What about beetle life stage (e.g., larva versus adult)?\n - What about the region of the gut from which the samples were taken?\n- You study vascular plant chemical defenses against herbivory\n - Should phylogeny (i.e., evolutionary relatedness) be a fixed or random effect?\n - What about feeding guild of the herbivore?\n:::\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas will be greater in sites further from the nearest road. We select ten study sites of varying distances from the nearest road and intensively count our furry friends at three plots within each site for several months. We return to our sites--and their associated plots--and repeat this process each year for three years. In the second year we have help from a new member of our lab but in the third year we're back to working alone (they had their own project to handle by then). We enter our data and perform careful quality control to get it into a tidy format ready for analyis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntarantula_df <- read.csv(file = file.path(\"data\", \"tarantulas.csv\"))\n\n# Check structure\nstr(tarantula_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of 6 variables:\n $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ road.dist_km : num 26.9 26.9 26.9 26.9 26.9 ...\n $ site : chr \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ plot : chr \"plot_a\" \"plot_a\" \"plot_a\" \"plot_a\" ...\n $ site.plot : chr \"A_a\" \"A_a\" \"A_a\" \"A_a\" ...\n $ tarantula_count: int 12 18 18 9 18 51 45 60 45 63 ...\n```\n\n\n:::\n:::\n\n\nWith our data in hand, we now want to run some statistical tests and--hopefully--get some endorphine-inducingly small _p_-values. If we choose to simply ignore our possible random effects, we could fit a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\ntarantula_lm <- lm(tarantula_count ~ road.dist_km, data = tarantula_df) # <1>\n\n# Extract summary\nsummary(tarantula_lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = tarantula_count ~ road.dist_km, data = tarantula_df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-180.360 -67.508 3.415 60.415 273.640 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.3262 13.1676 0.329 0.743 \nroad.dist_km 2.6306 0.2745 9.582 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 90.55 on 448 degrees of freedom\nMultiple R-squared: 0.1701,\tAdjusted R-squared: 0.1682 \nF-statistic: 91.82 on 1 and 448 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`\n\nThis naive test seems to support our hypothesis. However, sampling effort differed between the three study years. Not only was there a second person in the second year but we can also reasonably expect that by the third year in this system we had greatly improved our tarantula-finding skills. So, a random effect of year is definitely justified. We are not concerned that the different study years will affect the relationship between tarantula populations and road distance though so a random intercept is fine.\n\nThere could be an argument for including year as a fixed effect in its own right but some preliminary investigations reveal no significant climatic differences across the region we worked in those three years. So, while we think that years may differ from one another, that difference is not something we care to analyze.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem1 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year)\n Data: tarantula_df\n\nREML criterion at convergence: 5272.9\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.5660 -0.8039 0.1522 0.6220 2.7981 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n year (Intercept) 1548 39.34 \n Residual 7163 84.64 \nNumber of obs: 450, groups: year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 25.8335 3.1485 0.167 0.877 \nroad.dist_km 2.6306 0.2566 446.0000 10.252 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.451\n```\n\n\n:::\n:::\n\n1. This is the syntax for specifying a random intercept (random slope variables should be before the `|` where `1` goes for a random intercept)\n\nBy including that random effect we actually get a slightly stronger effect of road distance (T value of ~12 without versus ~13 with). This is because our new random effect accounts for some of the 'noise' between study years. That actually gives us a better picture of the relationship between our response and explanatory variables.\n\nNow that we're already using a mixed-effects model, we have little excuse not to account for the other potential random effect: plot! Remember that there were three plots within each site and from our extensive time in the field we have developed a strong intuition that there might be substantial among-plot variation at each site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +\n geom_violin(alpha = 0.5) + # <1>\n geom_jitter(width = 0.25, size = 0.5) +\n facet_wrap(site ~ .) +\n theme_bw() +\n theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-explore-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot\n2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another\n\nThis graph clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (can be thought of as \"statistical power\") for a variable that we don't actually care about. Instead, we could include plot as another random effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem2 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year) + (1|site.plot), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year) + (1 | site.plot)\n Data: tarantula_df\n\nREML criterion at convergence: 4471.1\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2684 -0.5698 -0.0689 0.5857 3.3902 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n site.plot (Intercept) 6671.0 81.68 \n year (Intercept) 1589.6 39.87 \n Residual 881.2 29.69 \nNumber of obs: 450, groups: site.plot, 30; year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 51.6178 23.3405 0.084 0.9339 \nroad.dist_km 2.6306 0.9633 28.0001 2.731 0.0108 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.847\n```\n\n\n:::\n:::\n\n1. Note that we need to use this column as the random effect because plots are not uniquely named across sites (i.e., all sites have plots \"a\", \"b\", and \"c\"). Making the random effect just the 'plot' column would fail to reflect how plots are nested within each site\n\nThis test reveals that while there is a significant relationship between road distance and tarantula population but the effect is not nearly as strong as it was when we let plot-level variation be ignored. This is likely due to high (or low) average populations in a single plot skewing the site-level average. Still, this is a result we can be more confident in because we've now accounted for all known sources of variation in our data--either by including them as fixed effects or including them as a random effects.\n\nWe can create one more graph of our tidy data and use some aesthetic settings to make sure the nested structure of the data is clear to those looking at our work. Note that you could also use predicted values from the model itself though that choice is--arguably--a matter of personal preference.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = road.dist_km)) +\n geom_point(aes(color = plot, shape = as.factor(year)), size = 2, alpha = 0.5) +\n geom_smooth(method = \"lm\", formula = \"y ~ x\", se = T, color = \"black\") +\n labs(y = \"Tarantula Abundance\", x = \"Distance to Nearest Road (km)\") +\n theme_bw()\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-final-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n## Multi-Model Inference\n\n\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n", + "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a \"flipped approach\" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Describe proposed analytical methods to an interested audience of mixed prior experience\n- Explain nuance in interpretation of results of proposed analyses\n- Identify some statistical tests common in synthesis research\n- Perform some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lmerTest\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\nlibrary(lmerTest)\n```\n:::\n\n\n## Hypothesis Framing Aside\n\nBefore we dive in, we should discuss two of the ways in which you can frame your hypothesis and the differences in interpretation and appropriate statistical tool(s) that follow from that choice. We'll restrict our conversation here to **two alternate modes of thinking about your hypothesis: frequentist statistics versus multi-model inference.**\n\nNote that this is something of a false dichotomy as tools from both worlds can be/are frequently used to complement one another. However, many graduate students are trained by instructors with strong feelings about one method _in opposition to_ the other so it is worthwhile to consider these two paths separately even if you wind up using components of both in your own work.\n\n::::{.panel-tabset}\n### Frequentist Inference\n\nHypotheses here are a question of whether a variable has a \"significant\" effect on another. \"Significant\" has a very precise meaning in this context that has to do with _p_-values. Fundamentally, these methods focus on whether the observed relationship in the data is likely to be observed by chance alone or not. Strong effects are less likely--though not impossible--to be observed due to random chance.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that X affects Y' then frequentist inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of frequentist inference will focus on mixed-effect models**.\n\n### Multi-Model Inference\n\nHyoptheses here are a question of which variables explain the _most_ variation in the data. Methods in this framing are unconcerned--or at least less concerned than in frequentist inference--with the probability associated with a particular variable. Intead, these methods focus on which of a set of user-defined candidate models explains most of the noise in the data _even when that best model does not necessarily explain much of that variation in absolute terms_.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that models including X explain more of the variation in Y than those that do not' then multi-model inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of multi-model inference will focus on comparing model strengths with AIC**.\n\n::::\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations. These variables (time or site) are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks 'what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations is accounted for_?' Such tests are called **mixed-effects models**. This name derives from considering explanatory variables \"fixed effects\" and non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Types of Random Effect\n\nThere are a few types of random effects but we can limit our conversation here to just two: random intercepts and random slopes.\n\n:::{.panel-tabset}\n#### Random Intercepts\n\nRandom intercepts should be used when you expect that the average response differs among levels of that variable but not in a way that changes the relationship between each level of this variable and the other variables (either fixed or random). In statistical terms you want to allow the intercept to change with levels of this variable.\n\nFor example, let's imagine that we are studying the effect of different organic farming practices on beneficial insect populations. We build relationships with several organic farmers willing to let us conduct this research on their properties and sample the insect communities at each farm over the course of a summer. However, we know that each farm is surrounded by a different habitat type that affects the composition of the local insect community. It is reasonable to expect that even farms where 'the same' management method is used are likely to differ because of this difference in landscape context.\n\nIn cases like this, we don't want to include a term for 'site' as a fixed effect but we do want to account for those differences so that our assessment of the significance of our explanatory variables isn't limited by the variation due to site.\n\n#### Random Slopes\n\nRandom slopes should be used when you expect that the average response differs among levels of that variable in a way that does change with other variables.\n\nFor example, let's imagine that we are studying the effect of temperature on avian malaria rates in songbirds. We identify several sites--along a gradient of daytime temperature ranges--where our species of interest can be found, capture them, and measure malaria infection rates. However, unbeknownst to us at the start of our study, our study sites have varying populations of dragonflies which affects local mosquito populations and malaria transmission/infection rates. If we revisit our sites repeatedly for several years is is reasonable to expect that this difference among sites likely affects the _relationship between_ daytime temperatures and songbird malaria rates.\n\nBy including site as a random slope in this context, we can account for this effect and still analyze our explanatory variables of interest. Note that random slopes are _very_ \"data hungry\" so you may not be able to use them without very high replication in your study design.\n:::\n\n### Nested Random Effects\n\nTo further complicate matters, we can use nested random effects as well. These can be either random intercepts or random slopes though they are more commonly seen with random intercepts. A nested random effect accounts for the effect of one random variable that _is itself affected by another variable!_ A classic example of this is when a study design uses two (or more) levels of spatial nestedness in their experimentall design.\n\nFor instance, let's imagine we were conducting a global study of marine plankton biodiversity. To gether these data we took several cruises (scientific not--exclusively--pleasure) at different places around the world and during each cruise we followed a set of transects. In each transect we did several plankton tows and quantified the diversity of each tow. We can reasonably assume the following:\n\n1. Each cruise differs from each other cruise (due to any number of climatic/ecological factors)\n - But cruises within the same part of the world are still likely to have similar planktonic communities\n2. Within each cruise, each transect differs from the others (again, due to unpreventable factors)\n - But transects within the same cruise are still likely to be more similar to one another than to transects in different cruises (even other ones in the same region!)\n3. Within each transect, each plankton tow differs from one another!\n - But again, more similar to other tows in the same transect than other tows in different transects/cruises\n\nIf we put these assumptions together we realize we want to account for the variation of cruise, transect, and tow while still retaining the nestedness of the similarity among samples. A nested random effect where transect is nested inside of cruise and tow is nested inside of transect would capture this effectively!\n\n### Philosophical Note: Random vs. Fixed\n\nDeciding whether a given variable should be a fixed or random effect can be tough. You'll likely need to rely on your scientific intuition about which feels more appropriate and then be prepared to defend that decision to your committee and/or \"reviewer \\#2\". It may prove helpful though to consider whether you 'care' about the effect of that variable. \n\nIf your hypothesis includes that variable than it should likely be a fixed effect. If the variable is just a facet of your experimental design but isn't something you're necessarily interested in testing, then it should likely be a random effect. And, once you've made your decision, it is totally okay to change your mind and tweak the structure of your model!\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Random versus Fixed Effects\n\nWith a small group, decide whether you think the terms in the examples below should be fixed effects or random effects:\n\n- You study small mammal populations in urban settings\n - Should 'proportion green space' be a fixed effect or a random effect?\n- You are part of a team studying leopard seal feeding behavior\n - What type of effect should 'observer' be?\n- You study the gut microbiota of a particular beetle species\n - Should 'beetle sex' be a fixed or a random effect?\n - What about beetle life stage (e.g., larva versus adult)?\n - What about the region of the gut from which the samples were taken?\n- You study vascular plant chemical defenses against herbivory\n - Should phylogeny (i.e., evolutionary relatedness) be a fixed or random effect?\n - What about feeding guild of the herbivore?\n:::\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas will be greater in sites further from the nearest road. We select ten study sites of varying distances from the nearest road and intensively count our furry friends at three plots within each site for several months. We return to our sites--and their associated plots--and repeat this process each year for three years. In the second year we have help from a new member of our lab but in the third year we're back to working alone (they had their own project to handle by then). We enter our data and perform careful quality control to get it into a tidy format ready for analyis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntarantula_df <- read.csv(file = file.path(\"data\", \"tarantulas.csv\"))\n\n# Check structure\nstr(tarantula_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of 6 variables:\n $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ road.dist_km : num 26.9 26.9 26.9 26.9 26.9 ...\n $ site : chr \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ plot : chr \"plot_a\" \"plot_a\" \"plot_a\" \"plot_a\" ...\n $ site.plot : chr \"A_a\" \"A_a\" \"A_a\" \"A_a\" ...\n $ tarantula_count: int 12 18 18 9 18 51 45 60 45 63 ...\n```\n\n\n:::\n:::\n\n\nWith our data in hand, we now want to run some statistical tests and--hopefully--get some endorphine-inducingly small _p_-values. If we choose to simply ignore our possible random effects, we could fit a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\ntarantula_lm <- lm(tarantula_count ~ road.dist_km, data = tarantula_df) # <1>\n\n# Extract summary\nsummary(tarantula_lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = tarantula_count ~ road.dist_km, data = tarantula_df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-180.360 -67.508 3.415 60.415 273.640 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.3262 13.1676 0.329 0.743 \nroad.dist_km 2.6306 0.2745 9.582 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 90.55 on 448 degrees of freedom\nMultiple R-squared: 0.1701,\tAdjusted R-squared: 0.1682 \nF-statistic: 91.82 on 1 and 448 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`\n\nThis naive test seems to support our hypothesis. However, sampling effort differed between the three study years. Not only was there a second person in the second year but we can also reasonably expect that by the third year in this system we had greatly improved our tarantula-finding skills. So, a random effect of year is definitely justified. We are not concerned that the different study years will affect the relationship between tarantula populations and road distance though so a random intercept is fine.\n\nThere could be an argument for including year as a fixed effect in its own right but some preliminary investigations reveal no significant climatic differences across the region we worked in those three years. So, while we think that years may differ from one another, that difference is not something we care to analyze.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem1 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year)\n Data: tarantula_df\n\nREML criterion at convergence: 5272.9\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.5660 -0.8039 0.1522 0.6220 2.7981 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n year (Intercept) 1548 39.34 \n Residual 7163 84.64 \nNumber of obs: 450, groups: year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 25.8335 3.1485 0.167 0.877 \nroad.dist_km 2.6306 0.2566 446.0000 10.252 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.451\n```\n\n\n:::\n:::\n\n1. This is the syntax for specifying a random intercept (random slope variables should be before the `|` where `1` goes for a random intercept)\n\nBy including that random effect we actually get a slightly stronger effect of road distance (T value of ~12 without versus ~13 with). This is because our new random effect accounts for some of the 'noise' between study years. That actually gives us a better picture of the relationship between our response and explanatory variables.\n\nNow that we're already using a mixed-effects model, we have little excuse not to account for the other potential random effect: plot! Remember that there were three plots within each site and from our extensive time in the field we have developed a strong intuition that there might be substantial among-plot variation at each site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +\n geom_violin(alpha = 0.5) + # <1>\n geom_jitter(width = 0.25, size = 0.5) +\n facet_wrap(site ~ .) +\n theme_bw() +\n theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-explore-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot\n2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another\n\nThis graph clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (can be thought of as \"statistical power\") for a variable that we don't actually care about. Instead, we could include plot as another random effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem2 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year) + (1|site.plot), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year) + (1 | site.plot)\n Data: tarantula_df\n\nREML criterion at convergence: 4471.1\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2684 -0.5698 -0.0689 0.5857 3.3902 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n site.plot (Intercept) 6671.0 81.68 \n year (Intercept) 1589.6 39.87 \n Residual 881.2 29.69 \nNumber of obs: 450, groups: site.plot, 30; year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 51.6178 23.3405 0.084 0.9339 \nroad.dist_km 2.6306 0.9633 28.0001 2.731 0.0108 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.847\n```\n\n\n:::\n:::\n\n1. Note that we need to use this column as the random effect because plots are not uniquely named across sites (i.e., all sites have plots \"a\", \"b\", and \"c\"). Making the random effect just the 'plot' column would fail to reflect how plots are nested within each site\n\nThis test reveals that while there is a significant relationship between road distance and tarantula population but the effect is not nearly as strong as it was when we let plot-level variation be ignored. This is likely due to high (or low) average populations in a single plot skewing the site-level average. Still, this is a result we can be more confident in because we've now accounted for all known sources of variation in our data--either by including them as fixed effects or including them as a random effects.\n\nWe can create one more graph of our tidy data and use some aesthetic settings to make sure the nested structure of the data is clear to those looking at our work. Note that you could also use predicted values from the model itself though that choice is--arguably--a matter of personal preference.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = road.dist_km)) +\n geom_point(aes(color = plot, shape = as.factor(year)), size = 2, alpha = 0.5) +\n geom_smooth(method = \"lm\", formula = \"y ~ x\", se = T, color = \"black\") +\n labs(y = \"Tarantula Abundance\", x = \"Distance to Nearest Road (km)\") +\n theme_bw()\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-final-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n## Multi-Model Inference\n\nRegardless of your choice of statistical test, multi-model inference may be an appropriate method to use to assess your hypothesis. As stated earlier, this frames your research question as a case of which variables _best_ explain the data rather than the likelihood of the observed effect relating to any variable in particular. \n\nTo begin, it can be helpful to write out all possible \"candidate models\". For instance, let's say that you measured some response variable (Y) and several potential explanatory variables (X, W, and Z).\n\n\n\n\nNote that for this method to be appropriate you need to fit the same type of model in all cases!\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n", "supporting": [ "mod_stats_files" ], diff --git a/_freeze/mod_stats/figure-html/mem-explore-graph-1.png b/_freeze/mod_stats/figure-html/mem-explore-graph-1.png index 05bb0d2..84f012d 100644 Binary files a/_freeze/mod_stats/figure-html/mem-explore-graph-1.png and b/_freeze/mod_stats/figure-html/mem-explore-graph-1.png differ diff --git a/mod_stats.qmd b/mod_stats.qmd index 6de8cbf..43fe300 100644 --- a/mod_stats.qmd +++ b/mod_stats.qmd @@ -218,7 +218,14 @@ ggplot(tarantula_df, aes(y = tarantula_count, x = road.dist_km)) + ## Multi-Model Inference +Regardless of your choice of statistical test, multi-model inference may be an appropriate method to use to assess your hypothesis. As stated earlier, this frames your research question as a case of which variables _best_ explain the data rather than the likelihood of the observed effect relating to any variable in particular. +To begin, it can be helpful to write out all possible "candidate models". For instance, let's say that you measured some response variable (Y) and several potential explanatory variables (X, W, and Z). + + + + +Note that for this method to be appropriate you need to fit the same type of model in all cases! ## Meta-Analysis