diff --git a/_freeze/mod_data-viz/execute-results/html.json b/_freeze/mod_data-viz/execute-results/html.json index 9d333df..93bb454 100644 --- a/_freeze/mod_data-viz/execute-results/html.json +++ b/_freeze/mod_data-viz/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "66d8fe3dc8a092ba43c2ddbeb0cdd0af", + "hash": "ea0c3f5b162db7f0404647c2c2ab7d01", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Data Visualization & Exploration\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nData visualization is a fundamental part of working with data. Visualization can be only used in the final stages of a project to make figures for publication but it can also be hugely valuable for quality control and hypothesis development processes. This module focuses on the fundamentals of graph creation in an effort to empower you to apply those methods in the various contexts where you might find visualization to be helpful.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Define fundamental `ggplot2` vocabulary\n- Identify appropriate graph types for given data type/distribution\n- Discuss differences between presentation- and publication-quality graphs\n- Explain how your graphs can be made more accessible\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lterdatasampler\")\ninstall.packages(\"supportR\")\ninstall.packages(\"cowplot\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\n```\n:::\n\n\n## Graphing with `ggplot2`\n\n### `ggplot2` Fundamentals\n\nYou may already be familiar with the `ggplot2` package in R but if you are not, it is a popular graphing library based on [The Grammar of Graphics](https://bookshop.org/p/books/the-grammar-of-graphics-leland-wilkinson/1518348?ean=9780387245447). Every ggplot is composed of four elements:\n\n1. A 'core' `ggplot` function call\n2. Aesthetics\n3. Geometries\n4. Theme\n\nNote that the theme component may be implicit in some graphs because there is a suite of default theme elements that applies unless otherwise specified. \n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module. This dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems (PIE) LTER](https://pie-lter.ecosystems.mbl.edu/welcome-plum-island-ecosystems-lter) site.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the lterdatasampler package\nlibrary(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n\n# Check its structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n\n\n:::\n:::\n\n\nWith a dataset in hand, let's make a scatterplot of crab size on the Y-axis with latitude on the X. We'll forgo doing anything to the theme elements at this point to focus on the other three elements.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. We're defining both the data and the X/Y aesthetics in this top-level bit of the plot. Also, note that each line ends with a plus sign\n2. Because we defined the data and aesthetics in the `ggplot()` function call above, this geometry can assume those mappings without re-specificying\n\nWe can improve on this graph by tweaking theme elements to make it use fewer of the default settings.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), # <1>\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-2-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. All theme elements require these `element_...` helper functions. `element_blank` removes theme elements but otherwise you'll need to use the helper function that corresponds to the type of theme element (e.g., `element_text` for theme elements affecting graph text)\n\n### Multiple Geometries\n\nWe can further modify `ggplot2` graphs by adding _multiple_ geometries if you find it valuable to do so. Note however that geometry order matters! Geometries added later will be \"in front of\" those added earlier. Also, adding too much data to a plot will begin to make it difficult for others to understand the central take-away of the graph so you may want to be careful about the level of information density in each graph. Let's add boxplots behind the points to characterize the distribution of points more quantitatively.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_boxplot(pch = 21) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), \n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-3-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. By putting the boxplot geometry first we ensure that it doesn't cover up the points that overlap with the 'box' part of each boxplot\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P1)\n\nIn a script, attempt the following with one of either yours or your group's datasets:\n\n- Make a graph using `ggplot2`\n - Include at least one geometry\n - Include at least one aesthetic (beyond X/Y axes)\n - Modify at least one theme element from the default\n\n:::\n\n### Multiple Data Objects\n\n`ggplot2` also supports adding more than one data object to the same graph! While this module doesn't cover map creation, maps are a common example of a graph with more than one data object. Another common use would be to include both the full dataset and some summarized facet of it in the same plot.\n\nLet's calculate some summary statistics of crab size to include that in our plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR library\nlibrary(supportR)\n\n# Summarize crab size within latitude groups\ncrab_summary <- supportR::summary_table(data = pie_crab, groups = c(\"site\", \"latitude\"),\n response = \"size\", drop_na = TRUE)\n\n# Check the structure\nstr(crab_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t13 obs. of 6 variables:\n $ site : chr \"BC\" \"CC\" \"CT\" \"DB\" ...\n $ latitude : num 42.2 41.9 41.3 39.1 30 39.6 41.6 33.3 42.7 34.7 ...\n $ mean : num 16.2 16.8 14.7 15.6 12.4 ...\n $ std_dev : num 4.81 2.05 2.36 2.12 1.8 2.72 2.29 2.42 2.3 2.34 ...\n $ sample_size: int 37 27 33 30 28 30 29 30 28 25 ...\n $ std_error : num 0.79 0.39 0.41 0.39 0.34 0.5 0.43 0.44 0.43 0.47 ...\n```\n\n\n:::\n:::\n\n\nWith this data object in-hand, we can make a graph that includes both this and the original, unsummarized crab data. To better focus on the 'multiple data objects' bit of this example we'll pare down on the actual graph code.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() + # <1>\n geom_point(pie_crab, mapping = aes(x = latitude, y = size, fill = site),\n pch = 21, size = 2, alpha = 0.2) + \n geom_errorbar(crab_summary, mapping = aes(x = latitude, # <2>\n ymax = mean + std_error,\n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, fill = site),\n pch = 23, size = 3) + \n theme(legend.title = element_blank(),\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-4-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. If you want multiple data objects in the same `ggplot2` graph you need to leave this top level `ggplot()` call _empty!_ Otherwise you'll get weird errors with aesthetics later in the graph\n2. This geometry adds the error bars and it's important that we add it before the summarized data points themselves if we want the error bars to be 'behind' their respective points\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P2)\n\nIn a script, attempt the following:\n\n- Add a second data object to the graph you made in the preceding activity\n - _Hint:_ If your first graph is unsummarized, add a summarized version (or vice versa)\n\n:::\n\n## Streamlining Graph Aesthetics\n\nSynthesis projects often generate an entire network of inter-related papers. Ensuring that all graphs across papers from a given team have a similar \"feel\" is a nice way of implying a certain standard of robustness for all of your group's projects. However, copy/pasting the theme elements of your graphs can (A) be cumbersome to do even once and (B) needs to be re-done every time you make a change anywhere. Fortunately, there is a better way!\n\n`ggplot2` supports adding theme elements to an object that can then be reused as needed elsewhere. This is the same theory behind wrapping repeated operations into custom functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Define core theme elements\ntheme_synthesis <- theme(legend.position = \"none\",\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"),\n axis.text = element_text(size = 13)) # <1>\n\n# Create a graph\nggplot(pie_crab, aes(y = water_temp, x = air_temp, color = size, size = size)) +\n geom_point() +\n theme_synthesis +\n theme(legend.position = \"right\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/std-theme-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. This theme element controls the text on the tick marks. `axis.title` controls the text in the _labels_ of the axes\n2. As a bonus, subsequent uses of `theme()` will replace defaults defined in your earlier theme object. So, you can design a set of theme elements that are _usually_ appropriate and then easily change just some of them as needed\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P3)\n\nIn a script, attempt the following:\n\n- Remove all theme edits from the graph you made in the preceding activity and assign them to a separate object\n - Then add that object to your graph\n- Make a second (different) graph and add your consolidated theme object to that graph as well\n\n:::\n\n## Multi-Panel Graphs\n\nIt is sometimes the case that you want to make a single graph file that has multiple panels. For many of us, we might default to creating the separate graphs that we want, exporting them, and then using software like Microsoft PowerPoint to stitch those panels into the single image we had in mind from the start. However, as all of us who have used this method know, this is hugely cumbersome when your advisor/committee/reviewers ask for edits and you now have to redo all of the manual work behind your multi-panel graph. \n\nFortunately, there are two nice entirely scripted alternatives that you might consider: **Faceted graphs** and **Plot grids**. See below for more information on both.\n\n:::{.panel-tabset}\n### Facets\n\nIn a faceted graph, every panel of the graph has the same aesthetics. These are often used when you want to show the relationship between two (or more) variables but separated by some other variable. In synthesis work, you might show the relationship between your core response and explanatory variables but facet by the original study. This would leave you with one panel per study where each would show the relationship only at that particular study.\n\nLet's check out an example.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(pie_crab, aes(x = date, y = size, color = site))+\n geom_point(size = 2) +\n facet_wrap(. ~ site) + # <1>\n theme_bw() +\n theme(legend.position = \"none\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/facet-1-1.png){fig-align='center' width=576}\n:::\n:::\n\n1. This is a `ggplot2` function that assumes you want panels laid out in a regular grid. There are other `facet_...` alternatives that let you specify row versus column arrangement. You could also facet by multiple variables by putting something to the left of the tilde\n2. We can remove the legend because the site names are in the facet titles in the gray boxes\n\n### Plot Grids\n\nIn a plot grid, each panel is completely independent of all others. These are often used in publications where you want to highlight several _different_ relationships that have some thematic connection. In synthesis work, your hypotheses may be more complicated than in primary research and such a plot grid would then be necessary to put all visual evidence for a hypothesis in the same location. On a practical note, plot grids are also a common way of circumventing figure number limits enforced by journals.\n\nLet's check out an example that relies on the `cowplot` library.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load a needed library\nlibrary(cowplot)\n\n# Create the first graph\ncrab_p1 <- ggplot(pie_crab, aes(x = site, y = size, fill = site)) + # <1>\n geom_violin() +\n coord_flip() + # <2>\n theme_bw() +\n theme(legend.position = \"none\")\n\n# Create the second\ncrab_p2 <- ggplot(pie_crab, aes(x = air_temp, y = water_temp)) +\n geom_errorbar(aes(ymax = water_temp + water_temp_sd, ymin = water_temp - water_temp_sd),\n width = 0.1) +\n geom_errorbarh(aes(xmax = air_temp + air_temp_sd, xmin = air_temp - air_temp_sd), # <3>\n width = 0.1) +\n geom_point(aes(fill = site), pch = 23, size = 3) +\n theme_bw()\n\n# Assemble into a plot grid\ncowplot::plot_grid(crab_p1, crab_p2, labels = \"AUTO\", nrow = 1) # <4>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/grid-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. Note that we're assigning these graphs to objects!\n2. This is a handy function for flipping X and Y axes without re-mapping the aesthetics\n3. This geometry is responsible for _horizontal_ error bars (note the \"h\" at the end of the function name)\n4. The `labels = \"AUTO\"` argument means that each panel of the plot grid gets the next sequential capital letter. You could also substitute that for a vector with labels of your choosing\n:::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P4)\n\nIn a script, attempt the following:\n\n- Assemble the two graphs you made in the preceding two activities into the appropriate type of multi-panel graph\n\n:::\n\n## Accessibility Considerations\n\nAfter you've made the graphs you need, it is good practice to revisit them with to ensure that they are as accessible as possible. You can of course also do this during the graph construction process but it is sometimes less onerous to tackle as a penultimate step in the figure creation process. There are many facets to accessibility and we've tried to cover just a few of them below.\n\n### Color Choice\n\nOne of the more well-known facets of accessibility in data visualization is choosing colors that are \"colorblind safe\". Such palettes still create distinctive colors for those with various forms of color blindness (e.g., deuteranomoly, protanomaly, etc.). The classic red-green heatmap for instance is very colorblind unsafe in that people with some forms of colorblindness cannot distinguish between those colors (hence the rise of the yellow-blue heatmap in recent years). Unforunately, the `ggplot2` default rainbow palette--while nice for exploratory purposes--_is not_ colorlbind sfae.\n\nSome websites (such as [colorbewer2.org](https://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=9)) include a simple checkbox for colorblindness safety which automatically limits the listed options to those that are colorblind safe. Alternately, you could use a browser plug-in (such as [Let's get color blind](https://chromewebstore.google.com/detail/lets-get-color-blind/bkdgdianpkfahpkmphgehigalpighjck) on Google Chrome) to simulate colorblindness on a particular page.\n\nOne extreme approach you could take is to dodge this issue entirely and format your graphs such that color either isn't used at all or only conveys information that is also conveyed in another graph aesthetic. We don't necessarily recommend this as color--when the palette is chosen correctly--can be a really nice way of making information-dense graphs more informative and easily-navigable by viewers.\n\n### Multiple Modalities\n\nRelated to the color conversation is the value of mapping multiple aesthetics to the same variable. By presenting information in multiple ways--even if that seems redundant--you enable a wider audience to gain an intuitive sense of what you're trying to display.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, \n fill = site, shape = site)) + # <1>\n geom_jitter(size = 2, width = 0.1, alpha = 0.6) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) + # <2>\n theme_bw() +\n theme(legend.title = element_blank())\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/multi-modal-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. In this graph we're mapping both the fill and shape aesthetics to site\n2. This is a little cumbersome but there are only five 'fill-able' shapes in R so we need to reuse some of them to have a unique one for each site. Using fill-able shapes is nice because you get a crisp black border around each point. See `?pch` for all available shapes\n\nIn the above graph, even though the rainbow palette is not ideal for reasons mentioned earlier, it is now much easier to tell the difference between sites with similar colors. For instance, \"NB\", \"NIB\", and \"PIE\" are all shades of light blue/teal. Now that they have unique shapes it is dramatically easier to look at the graph and identify which points correspond to which site.\n\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Graph Accessibility\n\nWith a group discuss (some of) the following questions:\n\n- What are other facets of accessibility that you think are important to consider when making data visualizations?\n- What changes do you make to your graphs to increase accessibility?\n - What changes _could_ you make going forward?\n\n:::\n\n\n### Presentation vs. Publication\n\nOne final element of accessibility to consider is the difference between a '_presentation_-quality' graph and a '_publication_-quality' one. While it may be tempting to create a single version of a given graph and use it in both contexts that is likely to be less effective in helping you to get your point across than making small tweaks to two separate versions of what is otherwise the same graph.\n\n:::{.panel-tabset}\n### Presentation-Focused\n\n**Do:**\n\n- Increase size of text/points **greatly**\n - If possible, sit in the back row of the room where you'll present and look at your graphs from there\n- _Consider_ adding graph elements that highlight certain graph regions\n- Present summarized data (increases focus on big-picture trends and avoids discussion of minutiae)\n- Map multiple aesthetics to the same variables\n\n**Don't:**\n\n- Use technical language / jargon\n- Include _unnecessary_ background elements\n- Use multi-panel graphs (either faceted or plot grid)\n - If you have multiple graph panels, put each on its own slide!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(crab_summary, aes(x = latitude, y = mean, \n shape = reorder(site, latitude), # <1>\n fill = reorder(site, latitude))) +\n geom_vline(xintercept = 36.5, color = \"black\", linetype = 1) +\n geom_vline(xintercept = 41.5, color = \"black\", linetype = 2) + # <2>\n geom_errorbar(mapping = aes(ymax = mean + std_error, ymin = mean - std_error),\n width = 0.2) +\n geom_point(size = 4) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Size (mm)\") + # <3>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 17),\n axis.text = element_text(size = 15))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/talk-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. We can use the `reorder` function to make the order of sites in the legend (from top to bottom) match the order of sites in the graph (from left to right)\n2. Adding vertical lines at particular parts in the graph can make comparisons within the same graph easier\n3. `labs` lets us customize the title and label text of a graph\n\n### Publication-Focused\n\n**Do:**\n\n- Increase size of text/points **slightly**\n - You want to be legible but you can more safely assume that many readers will be able to increase the zoom of their browser window if needed\n- Present un-summarized data (with or without summarized points included)\n - Many reviewers will want to get a sense for the \"real\" data so you should include unsummarized values wherever possible\n- Use multi-panel graphs\n - If multiple graphs \"tell a story\" together, then they should be included in the same file!\n- Map multiple aesthetics to the same variables\n- If publishing in a journal available in print, check to make sure your graph still makes sense in grayscale\n - There are nice browser plug-ins (like [Grayscale the Web](https://chromewebstore.google.com/detail/grayscale-the-web-save-si/mblmpdpfppogibmoobibfannckeeleag) for Google Chrome) for this too\n\n**Don't:**\n\n- Include _unnecessary_ background elements\n- Add graph elements that highlight certain graph regions\n - You can--and should--lean more heavily on the text of your publication to discuss particular areas of a graph\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() +\n geom_point(pie_crab, mapping = aes(x = latitude, y = size,\n color = reorder(site, latitude)),\n pch = 19, size = 1, alpha = 0.3) +\n geom_errorbar(crab_summary, mapping = aes(x = latitude, y = mean, \n ymax = mean + std_error, \n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, \n shape = reorder(site, latitude),\n fill = reorder(site, latitude)),\n size = 4) +\n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Carapace Width (mm)\") + # <1>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 15),\n axis.text = element_text(size = 13))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/pub-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. Here we are using a reasonable amount of technical language\n\n:::\n\n## Additional Resources\n\n### Papers & Documents\n\n- NCEAS [Colorblind Safe Color Schemes](https://www.nceas.ucsb.edu/sites/default/files/2022-06/Colorblind%20Safe%20Color%20Schemes.pdf) reference document\n\n### Workshops & Courses\n\n- NCEAS Scientific Computing team's Coding in the Tidyverse workshop [`ggplot2` module](https://nceas.github.io/scicomp-workshop-tidyverse/visualize.html)\n- The Carpentries' Data Analysis and Visualization in R for Ecologists [`ggplot2` episode](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html)\n\n\n### Websites\n\n- \n", + "markdown": "---\ntitle: \"Data Visualization & Exploration\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nData visualization is a fundamental part of working with data. Visualization can be only used in the final stages of a project to make figures for publication but it can also be hugely valuable for quality control and hypothesis development processes. This module focuses on the fundamentals of graph creation in an effort to empower you to apply those methods in the various contexts where you might find visualization to be helpful.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Define fundamental `ggplot2` vocabulary\n- Identify appropriate graph types for given data type/distribution\n- Discuss differences between presentation- and publication-quality graphs\n- Explain how your graphs can be made more accessible\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lterdatasampler\")\ninstall.packages(\"supportR\")\ninstall.packages(\"cowplot\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\n```\n:::\n\n\n## Graphing with `ggplot2`\n\n### `ggplot2` Fundamentals\n\nYou may already be familiar with the `ggplot2` package in R but if you are not, it is a popular graphing library based on [The Grammar of Graphics](https://bookshop.org/p/books/the-grammar-of-graphics-leland-wilkinson/1518348?ean=9780387245447). Every ggplot is composed of four elements:\n\n1. A 'core' `ggplot` function call\n2. Aesthetics\n3. Geometries\n4. Theme\n\nNote that the theme component may be implicit in some graphs because there is a suite of default theme elements that applies unless otherwise specified. \n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module. This dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems (PIE) LTER](https://pie-lter.ecosystems.mbl.edu/welcome-plum-island-ecosystems-lter) site.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the lterdatasampler package\nlibrary(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n\n# Check its structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n\n\n:::\n:::\n\n\nWith a dataset in hand, let's make a scatterplot of crab size on the Y-axis with latitude on the X. We'll forgo doing anything to the theme elements at this point to focus on the other three elements.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. We're defining both the data and the X/Y aesthetics in this top-level bit of the plot. Also, note that each line ends with a plus sign\n2. Because we defined the data and aesthetics in the `ggplot()` function call above, this geometry can assume those mappings without re-specificying\n\nWe can improve on this graph by tweaking theme elements to make it use fewer of the default settings.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), # <1>\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-2-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. All theme elements require these `element_...` helper functions. `element_blank` removes theme elements but otherwise you'll need to use the helper function that corresponds to the type of theme element (e.g., `element_text` for theme elements affecting graph text)\n\n### Multiple Geometries\n\nWe can further modify `ggplot2` graphs by adding _multiple_ geometries if you find it valuable to do so. Note however that geometry order matters! Geometries added later will be \"in front of\" those added earlier. Also, adding too much data to a plot will begin to make it difficult for others to understand the central take-away of the graph so you may want to be careful about the level of information density in each graph. Let's add boxplots behind the points to characterize the distribution of points more quantitatively.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, fill = site)) +\n geom_boxplot(pch = 21) + # <1>\n geom_point(pch = 21, size = 2, alpha = 0.5) +\n theme(legend.title = element_blank(), \n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-3-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. By putting the boxplot geometry first we ensure that it doesn't cover up the points that overlap with the 'box' part of each boxplot\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P1)\n\nIn a script, attempt the following with one of either yours or your group's datasets:\n\n- Make a graph using `ggplot2`\n - Include at least one geometry\n - Include at least one aesthetic (beyond X/Y axes)\n - Modify at least one theme element from the default\n\n:::\n\n### Multiple Data Objects\n\n`ggplot2` also supports adding more than one data object to the same graph! While this module doesn't cover map creation, maps are a common example of a graph with more than one data object. Another common use would be to include both the full dataset and some summarized facet of it in the same plot.\n\nLet's calculate some summary statistics of crab size to include that in our plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR library\nlibrary(supportR)\n\n# Summarize crab size within latitude groups\ncrab_summary <- supportR::summary_table(data = pie_crab, groups = c(\"site\", \"latitude\"),\n response = \"size\", drop_na = TRUE)\n\n# Check the structure\nstr(crab_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t13 obs. of 6 variables:\n $ site : chr \"BC\" \"CC\" \"CT\" \"DB\" ...\n $ latitude : num 42.2 41.9 41.3 39.1 30 39.6 41.6 33.3 42.7 34.7 ...\n $ mean : num 16.2 16.8 14.7 15.6 12.4 ...\n $ std_dev : num 4.81 2.05 2.36 2.12 1.8 2.72 2.29 2.42 2.3 2.34 ...\n $ sample_size: int 37 27 33 30 28 30 29 30 28 25 ...\n $ std_error : num 0.79 0.39 0.41 0.39 0.34 0.5 0.43 0.44 0.43 0.47 ...\n```\n\n\n:::\n:::\n\n\nWith this data object in-hand, we can make a graph that includes both this and the original, unsummarized crab data. To better focus on the 'multiple data objects' bit of this example we'll pare down on the actual graph code.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() + # <1>\n geom_point(pie_crab, mapping = aes(x = latitude, y = size, fill = site),\n pch = 21, size = 2, alpha = 0.2) + \n geom_errorbar(crab_summary, mapping = aes(x = latitude, # <2>\n ymax = mean + std_error,\n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, fill = site),\n pch = 23, size = 3) + \n theme(legend.title = element_blank(),\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/gg-4-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. If you want multiple data objects in the same `ggplot2` graph you need to leave this top level `ggplot()` call _empty!_ Otherwise you'll get weird errors with aesthetics later in the graph\n2. This geometry adds the error bars and it's important that we add it before the summarized data points themselves if we want the error bars to be 'behind' their respective points\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P2)\n\nIn a script, attempt the following:\n\n- Add a second data object to the graph you made in the preceding activity\n - _Hint:_ If your first graph is unsummarized, add a summarized version (or vice versa)\n\n:::\n\n## Streamlining Graph Aesthetics\n\nSynthesis projects often generate an entire network of inter-related papers. Ensuring that all graphs across papers from a given team have a similar \"feel\" is a nice way of implying a certain standard of robustness for all of your group's projects. However, copy/pasting the theme elements of your graphs can (A) be cumbersome to do even once and (B) needs to be re-done every time you make a change anywhere. Fortunately, there is a better way!\n\n`ggplot2` supports adding theme elements to an object that can then be reused as needed elsewhere. This is the same theory behind wrapping repeated operations into custom functions.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Define core theme elements\ntheme_synthesis <- theme(legend.position = \"none\",\n panel.background = element_blank(),\n axis.line = element_line(color = \"black\"),\n axis.text = element_text(size = 13)) # <1>\n\n# Create a graph\nggplot(pie_crab, aes(y = water_temp, x = air_temp, color = size, size = size)) +\n geom_point() +\n theme_synthesis +\n theme(legend.position = \"right\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/std-theme-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. This theme element controls the text on the tick marks. `axis.title` controls the text in the _labels_ of the axes\n2. As a bonus, subsequent uses of `theme()` will replace defaults defined in your earlier theme object. So, you can design a set of theme elements that are _usually_ appropriate and then easily change just some of them as needed\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P3)\n\nIn a script, attempt the following:\n\n- Remove all theme edits from the graph you made in the preceding activity and assign them to a separate object\n - Then add that object to your graph\n- Make a second (different) graph and add your consolidated theme object to that graph as well\n\n:::\n\n## Multi-Panel Graphs\n\nIt is sometimes the case that you want to make a single graph file that has multiple panels. For many of us, we might default to creating the separate graphs that we want, exporting them, and then using software like Microsoft PowerPoint to stitch those panels into the single image we had in mind from the start. However, as all of us who have used this method know, this is hugely cumbersome when your advisor/committee/reviewers ask for edits and you now have to redo all of the manual work behind your multi-panel graph. \n\nFortunately, there are two nice entirely scripted alternatives that you might consider: **Faceted graphs** and **Plot grids**. See below for more information on both.\n\n:::{.panel-tabset}\n### Facets\n\nIn a faceted graph, every panel of the graph has the same aesthetics. These are often used when you want to show the relationship between two (or more) variables but separated by some other variable. In synthesis work, you might show the relationship between your core response and explanatory variables but facet by the original study. This would leave you with one panel per study where each would show the relationship only at that particular study.\n\nLet's check out an example.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(pie_crab, aes(x = date, y = size, color = site))+\n geom_point(size = 2) +\n facet_wrap(. ~ site) + # <1>\n theme_bw() +\n theme(legend.position = \"none\") # <2>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/facet-1-1.png){fig-align='center' width=576}\n:::\n:::\n\n1. This is a `ggplot2` function that assumes you want panels laid out in a regular grid. There are other `facet_...` alternatives that let you specify row versus column arrangement. You could also facet by multiple variables by putting something to the left of the tilde\n2. We can remove the legend because the site names are in the facet titles in the gray boxes\n\n### Plot Grids\n\nIn a plot grid, each panel is completely independent of all others. These are often used in publications where you want to highlight several _different_ relationships that have some thematic connection. In synthesis work, your hypotheses may be more complicated than in primary research and such a plot grid would then be necessary to put all visual evidence for a hypothesis in the same location. On a practical note, plot grids are also a common way of circumventing figure number limits enforced by journals.\n\nLet's check out an example that relies on the `cowplot` library.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load a needed library\nlibrary(cowplot)\n\n# Create the first graph\ncrab_p1 <- ggplot(pie_crab, aes(x = site, y = size, fill = site)) + # <1>\n geom_violin() +\n coord_flip() + # <2>\n theme_bw() +\n theme(legend.position = \"none\")\n\n# Create the second\ncrab_p2 <- ggplot(pie_crab, aes(x = air_temp, y = water_temp)) +\n geom_errorbar(aes(ymax = water_temp + water_temp_sd, ymin = water_temp - water_temp_sd),\n width = 0.1) +\n geom_errorbarh(aes(xmax = air_temp + air_temp_sd, xmin = air_temp - air_temp_sd), # <3>\n width = 0.1) +\n geom_point(aes(fill = site), pch = 23, size = 3) +\n theme_bw()\n\n# Assemble into a plot grid\ncowplot::plot_grid(crab_p1, crab_p2, labels = \"AUTO\", nrow = 1) # <4>\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/grid-1-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. Note that we're assigning these graphs to objects!\n2. This is a handy function for flipping X and Y axes without re-mapping the aesthetics\n3. This geometry is responsible for _horizontal_ error bars (note the \"h\" at the end of the function name)\n4. The `labels = \"AUTO\"` argument means that each panel of the plot grid gets the next sequential capital letter. You could also substitute that for a vector with labels of your choosing\n:::\n\n:::{.callout-note icon=\"false\"}\n#### Activity: Graph Creation (P4)\n\nIn a script, attempt the following:\n\n- Assemble the two graphs you made in the preceding two activities into the appropriate type of multi-panel graph\n\n:::\n\n## Accessibility Considerations\n\nAfter you've made the graphs you need, it is good practice to revisit them with to ensure that they are as accessible as possible. You can of course also do this during the graph construction process but it is sometimes less onerous to tackle as a penultimate step in the figure creation process. There are many facets to accessibility and we've tried to cover just a few of them below.\n\n### Color Choice\n\nOne of the more well-known facets of accessibility in data visualization is choosing colors that are \"colorblind safe\". Such palettes still create distinctive colors for those with various forms of color blindness (e.g., deuteranomoly, protanomaly, etc.). The classic red-green heatmap for instance is very colorblind unsafe in that people with some forms of colorblindness cannot distinguish between those colors (hence the rise of the yellow-blue heatmap in recent years). Unforunately, the `ggplot2` default rainbow palette--while nice for exploratory purposes--_is not_ colorlbind sfae.\n\nSome websites (such as [colorbewer2.org](https://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=9)) include a simple checkbox for colorblindness safety which automatically limits the listed options to those that are colorblind safe. Alternately, you could use a browser plug-in (such as [Let's get color blind](https://chromewebstore.google.com/detail/lets-get-color-blind/bkdgdianpkfahpkmphgehigalpighjck) on Google Chrome) to simulate colorblindness on a particular page.\n\nOne extreme approach you could take is to dodge this issue entirely and format your graphs such that color either isn't used at all or only conveys information that is also conveyed in another graph aesthetic. We don't necessarily recommend this as color--when the palette is chosen correctly--can be a really nice way of making information-dense graphs more informative and easily-navigable by viewers.\n\n### Multiple Modalities\n\nRelated to the color conversation is the value of mapping multiple aesthetics to the same variable. By presenting information in multiple ways--even if that seems redundant--you enable a wider audience to gain an intuitive sense of what you're trying to display.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(data = pie_crab, mapping = aes(x = latitude, y = size, \n fill = site, shape = site)) + # <1>\n geom_jitter(size = 2, width = 0.1, alpha = 0.6) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) + # <2>\n theme_bw() +\n theme(legend.title = element_blank())\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/multi-modal-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. In this graph we're mapping both the fill and shape aesthetics to site\n2. This is a little cumbersome but there are only five 'fill-able' shapes in R so we need to reuse some of them to have a unique one for each site. Using fill-able shapes is nice because you get a crisp black border around each point. See `?pch` for all available shapes\n\nIn the above graph, even though the rainbow palette is not ideal for reasons mentioned earlier, it is now much easier to tell the difference between sites with similar colors. For instance, \"NB\", \"NIB\", and \"PIE\" are all shades of light blue/teal. Now that they have unique shapes it is dramatically easier to look at the graph and identify which points correspond to which site.\n\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Graph Accessibility\n\nWith a group discuss (some of) the following questions:\n\n- What are other facets of accessibility that you think are important to consider when making data visualizations?\n- What changes do you make to your graphs to increase accessibility?\n - What changes _could_ you make going forward?\n\n:::\n\n\n### Presentation vs. Publication\n\nOne final element of accessibility to consider is the difference between a '_presentation_-quality' graph and a '_publication_-quality' one. While it may be tempting to create a single version of a given graph and use it in both contexts that is likely to be less effective in helping you to get your point across than making small tweaks to two separate versions of what is otherwise the same graph.\n\n:::{.panel-tabset}\n### Presentation-Focused\n\n**Do:**\n\n- Increase size of text/points **greatly**\n - If possible, sit in the back row of the room where you'll present and look at your graphs from there\n- _Consider_ adding graph elements that highlight certain graph regions\n- Present summarized data (increases focus on big-picture trends and avoids discussion of minutiae)\n- Map multiple aesthetics to the same variables\n\n**Don't:**\n\n- Use technical language / jargon\n- Include _unnecessary_ background elements\n- Use multi-panel graphs (either faceted or plot grid)\n - If you have multiple graph panels, put each on its own slide!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(crab_summary, aes(x = latitude, y = mean, \n shape = reorder(site, latitude), # <1>\n fill = reorder(site, latitude))) +\n geom_vline(xintercept = 36.5, color = \"black\", linetype = 1) +\n geom_vline(xintercept = 41.5, color = \"black\", linetype = 2) + # <2>\n geom_errorbar(mapping = aes(ymax = mean + std_error, ymin = mean - std_error),\n width = 0.2) +\n geom_point(size = 4) + \n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Size (mm)\") + # <3>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 17),\n axis.text = element_text(size = 15))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/talk-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. We can use the `reorder` function to make the order of sites in the legend (from top to bottom) match the order of sites in the graph (from left to right)\n2. Adding vertical lines at particular parts in the graph can make comparisons within the same graph easier\n3. `labs` lets us customize the title and label text of a graph\n\n### Publication-Focused\n\n**Do:**\n\n- Increase size of text/points **slightly**\n - You want to be legible but you can more safely assume that many readers will be able to increase the zoom of their browser window if needed\n- Present un-summarized data (with or without summarized points included)\n - Many reviewers will want to get a sense for the \"real\" data so you should include unsummarized values wherever possible\n- Use multi-panel graphs\n - If multiple graphs \"tell a story\" together, then they should be included in the same file!\n- Map multiple aesthetics to the same variables\n- If publishing in a journal available in print, check to make sure your graph still makes sense in grayscale\n - There are nice browser plug-ins (like [Grayscale the Web](https://chromewebstore.google.com/detail/grayscale-the-web-save-si/mblmpdpfppogibmoobibfannckeeleag) for Google Chrome) for this too\n\n**Don't:**\n\n- Include _unnecessary_ background elements\n- Add graph elements that highlight certain graph regions\n - You can--and should--lean more heavily on the text of your publication to discuss particular areas of a graph\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot() +\n geom_point(pie_crab, mapping = aes(x = latitude, y = size,\n color = reorder(site, latitude)),\n pch = 19, size = 1, alpha = 0.3) +\n geom_errorbar(crab_summary, mapping = aes(x = latitude, y = mean, \n ymax = mean + std_error, \n ymin = mean - std_error),\n width = 0.2) +\n geom_point(crab_summary, mapping = aes(x = latitude, y = mean, \n shape = reorder(site, latitude),\n fill = reorder(site, latitude)),\n size = 4) +\n scale_shape_manual(values = c(21:25, 21:25, 21:23)) +\n labs(x = \"Latitude\", y = \"Mean Crab Carapace Width (mm)\") + # <1>\n theme(legend.title = element_blank(),\n axis.line = element_line(color = \"black\"),\n panel.background = element_blank(),\n axis.title = element_text(size = 15),\n axis.text = element_text(size = 13))\n```\n\n::: {.cell-output-display}\n![](mod_data-viz_files/figure-html/pub-graph-1.png){fig-align='center' width=864}\n:::\n:::\n\n1. Here we are using a reasonable amount of technical language\n\n:::\n\n## Ordination\n\nIf you are working with multivariate data (i.e., data where multiple columns are all response variables collectively) you may find ordination helpful. Ordination is the general term for many types of multivariate visualization but typically is used to refer to visualizing a distance or dissimiliarity measure of the data. Such measures collapse all of those columns of response variables into fewer (typically two) index values that are easier to visualize. Common examples of this include Principal Components Analysis (PCA), Non-Metric Multidimensional Scaling (NMS / NMDS), or Principal Coordinates Analysis (PCoA / \"metric multidimensional scaling\").\n\n## Maps\n\nYou may find it valuable to create a map as an additional way of visualizing data. Many synthesis groups do this--particularly when there is a strong spatial component to the research questions and/or hypotheses.\n\nCheck out the [bonus spatial data module](https://lter.github.io/ssecr/mod_spatial.html) for more information on map-making if this is of interest!\n\n## Additional Resources\n\n### Papers & Documents\n\n- NCEAS [Colorblind Safe Color Schemes](https://www.nceas.ucsb.edu/sites/default/files/2022-06/Colorblind%20Safe%20Color%20Schemes.pdf) reference document\n\n### Workshops & Courses\n\n- NCEAS Scientific Computing team's Coding in the Tidyverse workshop [`ggplot2` module](https://nceas.github.io/scicomp-workshop-tidyverse/visualize.html)\n- The Carpentries' Data Analysis and Visualization in R for Ecologists [`ggplot2` episode](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html)\n\n\n### Websites\n\n- \n", "supporting": [ "mod_data-viz_files" ], diff --git a/_freeze/mod_data-viz/figure-html/multi-modal-1.png b/_freeze/mod_data-viz/figure-html/multi-modal-1.png index eea7e00..d6b4403 100644 Binary files a/_freeze/mod_data-viz/figure-html/multi-modal-1.png and b/_freeze/mod_data-viz/figure-html/multi-modal-1.png differ diff --git a/_freeze/mod_spatial/execute-results/html.json b/_freeze/mod_spatial/execute-results/html.json index 6062541..4be56d6 100644 --- a/_freeze/mod_spatial/execute-results/html.json +++ b/_freeze/mod_spatial/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "bf53f688e30e538d8f9679505425afd4", + "hash": "a4eb01ff58493741083fae5f9ee93b5f", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Working with Spatial Data\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nUnder Construction\n\n## Learning Objectives\n\nAfter completing this topic you will be able to: \n\n- Define characteristics of common types of spatial data\n- Manipulate spatial data with R\n- Integrate spatial data with tabular data\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\n## Module Content\n\n\n\n## Additional Spatial Resources\n\n### Papers & Documents\n\n- \n\n### Workshops & Courses\n\n- The Carpentries' [Introduction to Geospatial Raster & Vector Data with R](https://datacarpentry.org/r-raster-vector-geospatial/)\n- The Carpentries' [Introduction to R for Geospatial Data](https://datacarpentry.org/r-intro-geospatial/index.html)\n- Arctic Data Center's [Spatial and Image Data Using GeoPandas](https://learning.nceas.ucsb.edu/2023-03-arctic/sections/geopandas.html) chapter of their Scalable Computing course\n- Jason Flower's (UC Santa Barbara) [Introduction to rasters with `terra`](https://jflowernet.github.io/intro-terra-ecodatascience/)\n\n### Websites\n\n- NASA's Application for Extracting and Exploring Analysis Ready Samples [(AppEEARS) Portal](https://appeears.earthdatacloud.nasa.gov/)\n", + "markdown": "---\ntitle: \"Working with Spatial Data\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nUnder Construction\n\n## Learning Objectives\n\nAfter completing this topic you will be able to: \n\n- Define characteristics of common types of spatial data\n- Manipulate spatial data with R\n- Integrate spatial data with tabular data\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\n## Raster versus Vector Data\n\n\n\n## Coordinate Reference Systems\n\n\n\n## Making Maps\n\n\n\n## Extracting Spatial Data\n\n\n\n## Additional Spatial Resources\n\n### Papers & Documents\n\n- \n\n### Workshops & Courses\n\n- The Carpentries' [Introduction to Geospatial Raster & Vector Data with R](https://datacarpentry.org/r-raster-vector-geospatial/)\n- The Carpentries' [Introduction to R for Geospatial Data](https://datacarpentry.org/r-intro-geospatial/index.html)\n- Arctic Data Center's [Spatial and Image Data Using GeoPandas](https://learning.nceas.ucsb.edu/2023-03-arctic/sections/geopandas.html) chapter of their Scalable Computing course\n- Jason Flower's (UC Santa Barbara) [Introduction to rasters with `terra`](https://jflowernet.github.io/intro-terra-ecodatascience/)\n\n### Websites\n\n- NASA's Application for Extracting and Exploring Analysis Ready Samples [(AppEEARS) Portal](https://appeears.earthdatacloud.nasa.gov/)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/mod_stats/execute-results/html.json b/_freeze/mod_stats/execute-results/html.json index a69c8e2..55bad3f 100644 --- a/_freeze/mod_stats/execute-results/html.json +++ b/_freeze/mod_stats/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "a224c2f8c96837d8c5c2a355ba34f696", + "hash": "547cf725e79f46ddf23c525fed7ee287", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a \"flipped approach\" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Describe proposed analytical methods to an interested audience of mixed prior experience\n- Explain nuance in interpretation of results of proposed analyses\n- Identify some statistical tests common in synthesis research\n- Perform some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lmerTest\")\ninstall.packages(\"palmerpenguins\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\nlibrary(lmerTest)\nlibrary(palmerpenguins)\n```\n:::\n\n\n## Hypothesis Framing Aside\n\nBefore we dive in, we should discuss two of the ways in which you can frame your hypothesis and the differences in interpretation and appropriate statistical tool(s) that follow from that choice. We'll restrict our conversation here to **two alternate modes of thinking about your hypothesis: frequentist statistics versus multi-model inference.**\n\nNote that this is something of a false dichotomy as tools from both worlds can be/are frequently used to complement one another. However, many graduate students are trained by instructors with strong feelings about one method _in opposition to_ the other so it is worthwhile to consider these two paths separately even if you wind up using components of both in your own work.\n\n::::{.panel-tabset}\n### Frequentist Inference\n\nHypotheses here are a question of whether a variable has a \"significant\" effect on another. \"Significant\" has a very precise meaning in this context that has to do with _p_-values. Fundamentally, these methods focus on whether the observed relationship in the data is likely to be observed by chance alone or not. Strong effects are less likely--though not impossible--to be observed due to random chance.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that X affects Y' then frequentist inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of frequentist inference will focus on mixed-effect models**.\n\n### Multi-Model Inference\n\nHyoptheses here are a question of which variables explain the _most_ variation in the data. Methods in this framing are unconcerned--or at least less concerned than in frequentist inference--with the probability associated with a particular variable. Intead, these methods focus on which of a set of user-defined candidate models explains most of the noise in the data _even when that best model does not necessarily explain much of that variation in absolute terms_.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that models including X explain more of the variation in Y than those that do not' then multi-model inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of multi-model inference will focus on comparing model strengths with AIC**.\n\n::::\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations. These variables (time or site) are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks 'what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations is accounted for_?' Such tests are called **mixed-effects models**. This name derives from considering explanatory variables \"fixed effects\" and non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Types of Random Effect\n\nThere are a few types of random effects but we can limit our conversation here to just two: random intercepts and random slopes.\n\n:::{.panel-tabset}\n#### Random Intercepts\n\nRandom intercepts should be used when you expect that the average response differs among levels of that variable but not in a way that changes the relationship between each level of this variable and the other variables (either fixed or random). In statistical terms you want to allow the intercept to change with levels of this variable.\n\nFor example, let's imagine that we are studying the effect of different organic farming practices on beneficial insect populations. We build relationships with several organic farmers willing to let us conduct this research on their properties and sample the insect communities at each farm over the course of a summer. However, we know that each farm is surrounded by a different habitat type that affects the composition of the local insect community. It is reasonable to expect that even farms where 'the same' management method is used are likely to differ because of this difference in landscape context.\n\nIn cases like this, we don't want to include a term for 'site' as a fixed effect but we do want to account for those differences so that our assessment of the significance of our explanatory variables isn't limited by the variation due to site.\n\n#### Random Slopes\n\nRandom slopes should be used when you expect that the average response differs among levels of that variable in a way that does change with other variables.\n\nFor example, let's imagine that we are studying the effect of temperature on avian malaria rates in songbirds. We identify several sites--along a gradient of daytime temperature ranges--where our species of interest can be found, capture them, and measure malaria infection rates. However, unbeknownst to us at the start of our study, our study sites have varying populations of dragonflies which affects local mosquito populations and malaria transmission/infection rates. If we revisit our sites repeatedly for several years is is reasonable to expect that this difference among sites likely affects the _relationship between_ daytime temperatures and songbird malaria rates.\n\nBy including site as a random slope in this context, we can account for this effect and still analyze our explanatory variables of interest. Note that random slopes are _very_ \"data hungry\" so you may not be able to use them without very high replication in your study design.\n:::\n\n### Nested Random Effects\n\nTo further complicate matters, we can use nested random effects as well. These can be either random intercepts or random slopes though they are more commonly seen with random intercepts. A nested random effect accounts for the effect of one random variable that _is itself affected by another variable!_ A classic example of this is when a study design uses two (or more) levels of spatial nestedness in their experimentall design.\n\nFor instance, let's imagine we were conducting a global study of marine plankton biodiversity. To gether these data we took several cruises (scientific not--exclusively--pleasure) at different places around the world and during each cruise we followed a set of transects. In each transect we did several plankton tows and quantified the diversity of each tow. We can reasonably assume the following:\n\n1. Each cruise differs from each other cruise (due to any number of climatic/ecological factors)\n - But cruises within the same part of the world are still likely to have similar planktonic communities\n2. Within each cruise, each transect differs from the others (again, due to unpreventable factors)\n - But transects within the same cruise are still likely to be more similar to one another than to transects in different cruises (even other ones in the same region!)\n3. Within each transect, each plankton tow differs from one another!\n - But again, more similar to other tows in the same transect than other tows in different transects/cruises\n\nIf we put these assumptions together we realize we want to account for the variation of cruise, transect, and tow while still retaining the nestedness of the similarity among samples. A nested random effect where transect is nested inside of cruise and tow is nested inside of transect would capture this effectively!\n\n### Philosophical Note: Random vs. Fixed\n\nDeciding whether a given variable should be a fixed or random effect can be tough. You'll likely need to rely on your scientific intuition about which feels more appropriate and then be prepared to defend that decision to your committee and/or \"reviewer \\#2\". It may prove helpful though to consider whether you 'care' about the effect of that variable. \n\nIf your hypothesis includes that variable than it should likely be a fixed effect. If the variable is just a facet of your experimental design but isn't something you're necessarily interested in testing, then it should likely be a random effect. And, once you've made your decision, it is totally okay to change your mind and tweak the structure of your model!\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Random versus Fixed Effects\n\nWith a small group, decide whether you think the terms in the examples below should be fixed effects or random effects:\n\n- You study small mammal populations in urban settings\n - Should 'proportion green space' be a fixed effect or a random effect?\n- You are part of a team studying leopard seal feeding behavior\n - What type of effect should 'observer' be?\n- You study the gut microbiota of a particular beetle species\n - Should 'beetle sex' be a fixed or a random effect?\n - What about beetle life stage (e.g., larva versus adult)?\n - What about the region of the gut from which the samples were taken?\n- You study vascular plant chemical defenses against herbivory\n - Should phylogeny (i.e., evolutionary relatedness) be a fixed or random effect?\n - What about feeding guild of the herbivore?\n:::\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas will be greater in sites further from the nearest road. We select ten study sites of varying distances from the nearest road and intensively count our furry friends at three plots within each site for several months. We return to our sites--and their associated plots--and repeat this process each year for three years. In the second year we have help from a new member of our lab but in the third year we're back to working alone (they had their own project to handle by then). We enter our data and perform careful quality control to get it into a tidy format ready for analyis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntarantula_df <- read.csv(file = file.path(\"data\", \"tarantulas.csv\"))\n\n# Check structure\nstr(tarantula_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of 6 variables:\n $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ road.dist_km : num 26.9 26.9 26.9 26.9 26.9 ...\n $ site : chr \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ plot : chr \"plot_a\" \"plot_a\" \"plot_a\" \"plot_a\" ...\n $ site.plot : chr \"A_a\" \"A_a\" \"A_a\" \"A_a\" ...\n $ tarantula_count: int 12 18 18 9 18 51 45 60 45 63 ...\n```\n\n\n:::\n:::\n\n\nWith our data in hand, we now want to run some statistical tests and--hopefully--get some endorphine-inducingly small _p_-values. If we choose to simply ignore our possible random effects, we could fit a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\ntarantula_lm <- lm(tarantula_count ~ road.dist_km, data = tarantula_df) # <1>\n\n# Extract summary\nsummary(tarantula_lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = tarantula_count ~ road.dist_km, data = tarantula_df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-180.360 -67.508 3.415 60.415 273.640 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.3262 13.1676 0.329 0.743 \nroad.dist_km 2.6306 0.2745 9.582 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 90.55 on 448 degrees of freedom\nMultiple R-squared: 0.1701,\tAdjusted R-squared: 0.1682 \nF-statistic: 91.82 on 1 and 448 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`\n\nThis naive test seems to support our hypothesis. However, sampling effort differed between the three study years. Not only was there a second person in the second year but we can also reasonably expect that by the third year in this system we had greatly improved our tarantula-finding skills. So, a random effect of year is definitely justified. We are not concerned that the different study years will affect the relationship between tarantula populations and road distance though so a random intercept is fine.\n\nThere could be an argument for including year as a fixed effect in its own right but some preliminary investigations reveal no significant climatic differences across the region we worked in those three years. So, while we think that years may differ from one another, that difference is not something we care to analyze.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem1 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year)\n Data: tarantula_df\n\nREML criterion at convergence: 5272.9\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.5660 -0.8039 0.1522 0.6220 2.7981 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n year (Intercept) 1548 39.34 \n Residual 7163 84.64 \nNumber of obs: 450, groups: year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 25.8335 3.1485 0.167 0.877 \nroad.dist_km 2.6306 0.2566 446.0000 10.252 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.451\n```\n\n\n:::\n:::\n\n1. This is the syntax for specifying a random intercept (random slope variables should be before the `|` where `1` goes for a random intercept)\n\nBy including that random effect we actually get a slightly stronger effect of road distance (T value of ~12 without versus ~13 with). This is because our new random effect accounts for some of the 'noise' between study years. That actually gives us a better picture of the relationship between our response and explanatory variables.\n\nNow that we're already using a mixed-effects model, we have little excuse not to account for the other potential random effect: plot! Remember that there were three plots within each site and from our extensive time in the field we have developed a strong intuition that there might be substantial among-plot variation at each site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +\n geom_violin(alpha = 0.5) + # <1>\n geom_jitter(width = 0.25, size = 0.5) +\n facet_wrap(site ~ .) +\n theme_bw() +\n theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-explore-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot\n2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another\n\nThis graph clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (can be thought of as \"statistical power\") for a variable that we don't actually care about. Instead, we could include plot as another random effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem2 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year) + (1|site.plot), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year) + (1 | site.plot)\n Data: tarantula_df\n\nREML criterion at convergence: 4471.1\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2684 -0.5698 -0.0689 0.5857 3.3902 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n site.plot (Intercept) 6671.0 81.68 \n year (Intercept) 1589.6 39.87 \n Residual 881.2 29.69 \nNumber of obs: 450, groups: site.plot, 30; year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 51.6178 23.3405 0.084 0.9339 \nroad.dist_km 2.6306 0.9633 28.0001 2.731 0.0108 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.847\n```\n\n\n:::\n:::\n\n1. Note that we need to use this column as the random effect because plots are not uniquely named across sites (i.e., all sites have plots \"a\", \"b\", and \"c\"). Making the random effect just the 'plot' column would fail to reflect how plots are nested within each site\n\nThis test reveals that while there is a significant relationship between road distance and tarantula population but the effect is not nearly as strong as it was when we let plot-level variation be ignored. This is likely due to high (or low) average populations in a single plot skewing the site-level average. Still, this is a result we can be more confident in because we've now accounted for all known sources of variation in our data--either by including them as fixed effects or including them as a random effects.\n\nWe can create one more graph of our tidy data and use some aesthetic settings to make sure the nested structure of the data is clear to those looking at our work. Note that you could also use predicted values from the model itself though that choice is--arguably--a matter of personal preference.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = road.dist_km)) +\n geom_point(aes(color = plot, shape = as.factor(year)), size = 2, alpha = 0.5) +\n geom_smooth(method = \"lm\", formula = \"y ~ x\", se = T, color = \"black\") +\n labs(y = \"Tarantula Abundance\", x = \"Distance to Nearest Road (km)\") +\n theme_bw()\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-final-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n## Multi-Model Inference\n\nRegardless of your choice of statistical test, multi-model inference may be an appropriate method to use to assess your hypothesis. As stated earlier, this frames your research question as a case of which variables _best_ explain the data rather than the likelihood of the observed effect relating to any variable in particular. \n\nTo begin, it can be helpful to write out all possible \"candidate models\". For instance, let's say that you measured some response variable (Y) and several potential explanatory variables (X, W, and Z). We would then fit the following candidate models:\n\n1. X alone explains the most variation in Y\n2. W alone explains the most variation in Y\n3. Z alone explains the most variation in Y\n4. X, W, and Z together explain the most variation in Y\n\nWe might also fit other candidate models for pairs of X, W, and Z but for the sake of simplicity in this hypothetical we'll skip those. Note that for this method to be appropriate you need to fit the same type of model in all cases!\n\nOnce we've fit all of our models and assigned them to objects, we can use the `AIC` function included in base R to compare the AIC score of each model. \"AIC\" stands for Akaike (_AH-kuh-ee-kay_) Information Criterion and is one of several related information criteria for summarizing a model's explanatory power. Models with more parameters are penalized to make it mathematically possible for a model with fewer explanatory variables to still do a better job capturing the variation in the data.\n\nThe model with the _lowest_ AIC best explains the data. Technically any difference in AIC indicates model improvement but many scientists use a rule of thumb of a difference of 2. So, if two models have AIC scores that differ by less than 2, you can safely say that they have comparable explanatory power. That is definitely a semi-arbitrary threshold but so is the 0.05 threshold for _p_-value \"significance\".\n\n### AIC Case Study\n\nLet's check out an example using AIC to compare the strengths of several models. Rather than using simulated data--as we did earlier in the mixed-effect model section--we'll use some real penguin data included in the `palmerpenguins` package.\n\nThis dataset includes annual data on three penguin species spread across several islands. The sex of the penguins was also recorded in addition to the length of their flippers, body mass, and bill length and depth.\n\nFor the purposes of this example, our research question is as follows: **what factors best explain penguin body mass?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the penguins data from the `palmerpenguins` package\ndata(penguins)\n\n# Make a version where no NAs are allowed\npeng_complete <- penguins[complete.cases(penguins), ] # <1>\n\n# Check the structure of it\ndplyr::glimpse(peng_complete)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 333\nColumns: 8\n$ species Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…\n$ bill_depth_mm 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…\n$ flipper_length_mm 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…\n$ body_mass_g 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…\n$ sex male, female, female, female, male, female, male, fe…\n$ year 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n\n\n:::\n:::\n\n1. This is a base R way of keeping only rows that have _no_ `NA` values in any column. It is better to identify and handle `NA`s more carefully but for this context we just want to have the same number of observations in each model\n\nWith our data in hand and research question in mind, we can fit several candidate models that our scientific intuition and the published literature support as probable then compare them with AIC.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Species and sex\nmod_spp <- lm(body_mass_g ~ species + sex, data = peng_complete)\n\n# Island alone\nmod_isl <- lm(body_mass_g ~ island, data = peng_complete)\n\n# Combination of species and island\nmod_eco <- lm(body_mass_g ~ island + species + sex, data = peng_complete)\n\n# Body characteristics alone\nmod_phys <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,\n data = peng_complete)\n\n# Global model\nmod_sink <- lm(body_mass_g ~ island + species + sex + # <1>\n flipper_length_mm + bill_length_mm + bill_depth_mm,\n data = peng_complete)\n```\n:::\n\n1. We've named the global model \"sink\" because of the American idiom \"everything but the kitchen sink.\" It is used in cases where everything that can be included has been\n\nOnce we've fit all of these models, we can use the `AIC` function from base R (technically from the `stats` package included in base R).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compare models\nAIC(mod_spp, mod_isl, mod_eco, mod_phys, mod_sink) %>% \n dplyr::arrange(AIC) # <1>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n df AIC\nmod_sink 10 4727.242\nmod_spp 5 4785.594\nmod_eco 7 4789.480\nmod_phys 5 4929.554\nmod_isl 4 5244.224\n```\n\n\n:::\n:::\n\n1. Unfortunately, the `AIC` function doesn't sort by AIC score automatically so we're using the `arrange` function to make it easier for us to rank models by their AIC scores\n\nInterestingly, it looks like the best model (i.e., the one that explains most of the data) is the global model that included most of the available variables. As stated earlier, it is not always the case that the model with the most parameters has the lowest AIC so we can be confident this is a \"real\" result. The difference between that one and the next (incidentally the model where only species and sex are included as explanatory variables) is much larger than 2 so we can be confident that the global model is much better than the next best.\n\nWith this result your interpretation would be that penguin body mass is better explained by a combination of species, sex, physical characteristics of the individual penguin, and the penguin's home island than it is by any of the other candidate models. In a publication you'd likely want to report this entire AIC table (either parenthetically or in a table) so that reviewers could evaluate your logic.\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n", + "markdown": "---\ntitle: \"Analysis & Modeling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nGiven the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module by splitting it into two halves.\n\n1. First half: a \"flipped approach\" where project teams will share their proposed analyses with one another\n2. Second half: typical instructional module dedicated to **analyses that are more common in--or exclusive to--synthesis research**. \n\nContent produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content.\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Describe proposed analytical methods to an interested audience of mixed prior experience\n- Explain nuance in interpretation of results of proposed analyses\n- Identify some statistical tests common in synthesis research\n- Perform some synthesis-specific analyses\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"tidyverse\")\ninstall.packages(\"lmerTest\")\ninstall.packages(\"palmerpenguins\")\n```\n:::\n\n\nWe'll go ahead and load some of these libraries as well to be able to better demonstrate these concepts.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load needed libraries\nlibrary(tidyverse)\nlibrary(lmerTest)\nlibrary(palmerpenguins)\n```\n:::\n\n\n## Hypothesis Framing Aside\n\nBefore we dive in, we should discuss two of the ways in which you can frame your hypothesis and the differences in interpretation and appropriate statistical tool(s) that follow from that choice. We'll restrict our conversation here to **two alternate modes of thinking about your hypothesis: frequentist statistics versus multi-model inference.**\n\nNote that this is something of a false dichotomy as tools from both worlds can be/are frequently used to complement one another. However, many graduate students are trained by instructors with strong feelings about one method _in opposition to_ the other so it is worthwhile to consider these two paths separately even if you wind up using components of both in your own work.\n\n::::{.panel-tabset}\n### Frequentist Inference\n\nHypotheses here are a question of whether a variable has a \"significant\" effect on another. \"Significant\" has a very precise meaning in this context that has to do with _p_-values. Fundamentally, these methods focus on whether the observed relationship in the data is likely to be observed by chance alone or not. Strong effects are less likely--though not impossible--to be observed due to random chance.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that X affects Y' then frequentist inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of frequentist inference will focus on mixed-effect models**.\n\n### Multi-Model Inference\n\nHyoptheses here are a question of which variables explain the _most_ variation in the data. Methods in this framing are unconcerned--or at least less concerned than in frequentist inference--with the probability associated with a particular variable. Intead, these methods focus on which of a set of user-defined candidate models explains most of the noise in the data _even when that best model does not necessarily explain much of that variation in absolute terms_.\n\nIf your hypothesis can be summarized as something along the lines of 'we hypothesize that models including X explain more of the variation in Y than those that do not' then multi-model inference may be a more appropriate methodology.\n\nFor the purposes of SSECR, **our discussion of multi-model inference will focus on comparing model strengths with AIC**.\n\n::::\n\n## Mixed-Effects Models\n\nIn any statistical test there is at least one response variable (a.k.a. \"dependent\" variable) and some number of explanatory variables (a.k.a. \"independent\" variables). However, in biology our experiments often involve repeated sampling over time or at the same locations. These variables (time or site) are neither response nor explanatory variables but we might reasonably conclude that they affect our response and/or explanatory variables.\n\nIn essence we want to use a statistical tool that asks 'what is the effect of the explanatory variable(s) on the response _when the variation due to these non-variable considerations is accounted for_?' Such tests are called **mixed-effects models**. This name derives from considering explanatory variables \"fixed effects\" and non-explanatory/response variables as \"random effects\". Including both fixed and random effects thus creates a model with \"mixed effects.\"\n\n### Types of Random Effect\n\nThere are a few types of random effects but we can limit our conversation here to just two: random intercepts and random slopes.\n\n:::{.panel-tabset}\n#### Random Intercepts\n\nRandom intercepts should be used when you expect that the average response differs among levels of that variable but not in a way that changes the relationship between each level of this variable and the other variables (either fixed or random). In statistical terms you want to allow the intercept to change with levels of this variable.\n\nFor example, let's imagine that we are studying the effect of different organic farming practices on beneficial insect populations. We build relationships with several organic farmers willing to let us conduct this research on their properties and sample the insect communities at each farm over the course of a summer. However, we know that each farm is surrounded by a different habitat type that affects the composition of the local insect community. It is reasonable to expect that even farms where 'the same' management method is used are likely to differ because of this difference in landscape context.\n\nIn cases like this, we don't want to include a term for 'site' as a fixed effect but we do want to account for those differences so that our assessment of the significance of our explanatory variables isn't limited by the variation due to site.\n\n#### Random Slopes\n\nRandom slopes should be used when you expect that the average response differs among levels of that variable in a way that does change with other variables.\n\nFor example, let's imagine that we are studying the effect of temperature on avian malaria rates in songbirds. We identify several sites--along a gradient of daytime temperature ranges--where our species of interest can be found, capture them, and measure malaria infection rates. However, unbeknownst to us at the start of our study, our study sites have varying populations of dragonflies which affects local mosquito populations and malaria transmission/infection rates. If we revisit our sites repeatedly for several years is is reasonable to expect that this difference among sites likely affects the _relationship between_ daytime temperatures and songbird malaria rates.\n\nBy including site as a random slope in this context, we can account for this effect and still analyze our explanatory variables of interest. Note that random slopes are _very_ \"data hungry\" so you may not be able to use them without very high replication in your study design.\n:::\n\n### Nested Random Effects\n\nTo further complicate matters, we can use nested random effects as well. These can be either random intercepts or random slopes though they are more commonly seen with random intercepts. A nested random effect accounts for the effect of one random variable that _is itself affected by another variable!_ A classic example of this is when a study design uses two (or more) levels of spatial nestedness in their experimentall design.\n\nFor instance, let's imagine we were conducting a global study of marine plankton biodiversity. To gether these data we took several cruises (scientific not--exclusively--pleasure) at different places around the world and during each cruise we followed a set of transects. In each transect we did several plankton tows and quantified the diversity of each tow. We can reasonably assume the following:\n\n1. Each cruise differs from each other cruise (due to any number of climatic/ecological factors)\n - But cruises within the same part of the world are still likely to have similar planktonic communities\n2. Within each cruise, each transect differs from the others (again, due to unpreventable factors)\n - But transects within the same cruise are still likely to be more similar to one another than to transects in different cruises (even other ones in the same region!)\n3. Within each transect, each plankton tow differs from one another!\n - But again, more similar to other tows in the same transect than other tows in different transects/cruises\n\nIf we put these assumptions together we realize we want to account for the variation of cruise, transect, and tow while still retaining the nestedness of the similarity among samples. A nested random effect where transect is nested inside of cruise and tow is nested inside of transect would capture this effectively!\n\n### Philosophical Note: Random vs. Fixed\n\nDeciding whether a given variable should be a fixed or random effect can be tough. You'll likely need to rely on your scientific intuition about which feels more appropriate and then be prepared to defend that decision to your committee and/or \"reviewer \\#2\". It may prove helpful though to consider whether you 'care' about the effect of that variable. \n\nIf your hypothesis includes that variable than it should likely be a fixed effect. If the variable is just a facet of your experimental design but isn't something you're necessarily interested in testing, then it should likely be a random effect. And, once you've made your decision, it is totally okay to change your mind and tweak the structure of your model!\n\n:::{.callout-warning icon=\"false\"}\n#### Discussion: Random versus Fixed Effects\n\nWith a small group, decide whether you think the terms in the examples below should be fixed effects or random effects:\n\n- You study small mammal populations in urban settings\n - Should 'proportion green space' be a fixed effect or a random effect?\n- You are part of a team studying leopard seal feeding behavior\n - What type of effect should 'observer' be?\n- You study the gut microbiota of a particular beetle species\n - Should 'beetle sex' be a fixed or a random effect?\n - What about beetle life stage (e.g., larva versus adult)?\n - What about the region of the gut from which the samples were taken?\n- You study vascular plant chemical defenses against herbivory\n - Should phylogeny (i.e., evolutionary relatedness) be a fixed or random effect?\n - What about feeding guild of the herbivore?\n:::\n\n### Mixed-Effects Case Study\n\nLet's imagine we are researching tarantula populations for several years in the Chihuahuan Desert. Our hypothesis is that the number of tarantulas will be greater in sites further from the nearest road. We select ten study sites of varying distances from the nearest road and intensively count our furry friends at three plots within each site for several months. We return to our sites--and their associated plots--and repeat this process each year for three years. In the second year we have help from a new member of our lab but in the third year we're back to working alone (they had their own project to handle by then). We enter our data and perform careful quality control to get it into a tidy format ready for analyis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in data\ntarantula_df <- read.csv(file = file.path(\"data\", \"tarantulas.csv\"))\n\n# Check structure\nstr(tarantula_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t450 obs. of 6 variables:\n $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...\n $ road.dist_km : num 26.9 26.9 26.9 26.9 26.9 ...\n $ site : chr \"site_A\" \"site_A\" \"site_A\" \"site_A\" ...\n $ plot : chr \"plot_a\" \"plot_a\" \"plot_a\" \"plot_a\" ...\n $ site.plot : chr \"A_a\" \"A_a\" \"A_a\" \"A_a\" ...\n $ tarantula_count: int 12 18 18 9 18 51 45 60 45 63 ...\n```\n\n\n:::\n:::\n\n\nWith our data in hand, we now want to run some statistical tests and--hopefully--get some endorphine-inducingly small _p_-values. If we choose to simply ignore our possible random effects, we could fit a linear regression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit model\ntarantula_lm <- lm(tarantula_count ~ road.dist_km, data = tarantula_df) # <1>\n\n# Extract summary\nsummary(tarantula_lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = tarantula_count ~ road.dist_km, data = tarantula_df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-180.360 -67.508 3.415 60.415 273.640 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.3262 13.1676 0.329 0.743 \nroad.dist_km 2.6306 0.2745 9.582 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 90.55 on 448 degrees of freedom\nMultiple R-squared: 0.1701,\tAdjusted R-squared: 0.1682 \nF-statistic: 91.82 on 1 and 448 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n1. R syntax for statistical tests is `response ~ explanatory` a.k.a. `Y ~ X`\n\nThis naive test seems to support our hypothesis. However, sampling effort differed between the three study years. Not only was there a second person in the second year but we can also reasonably expect that by the third year in this system we had greatly improved our tarantula-finding skills. So, a random effect of year is definitely justified. We are not concerned that the different study years will affect the relationship between tarantula populations and road distance though so a random intercept is fine.\n\nThere could be an argument for including year as a fixed effect in its own right but some preliminary investigations reveal no significant climatic differences across the region we worked in those three years. So, while we think that years may differ from one another, that difference is not something we care to analyze.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem1 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year)\n Data: tarantula_df\n\nREML criterion at convergence: 5272.9\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-2.5660 -0.8039 0.1522 0.6220 2.7981 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n year (Intercept) 1548 39.34 \n Residual 7163 84.64 \nNumber of obs: 450, groups: year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 25.8335 3.1485 0.167 0.877 \nroad.dist_km 2.6306 0.2566 446.0000 10.252 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.451\n```\n\n\n:::\n:::\n\n1. This is the syntax for specifying a random intercept (random slope variables should be before the `|` where `1` goes for a random intercept)\n\nBy including that random effect we actually get a slightly stronger effect of road distance (T value of ~12 without versus ~13 with). This is because our new random effect accounts for some of the 'noise' between study years. That actually gives us a better picture of the relationship between our response and explanatory variables.\n\nNow that we're already using a mixed-effects model, we have little excuse not to account for the other potential random effect: plot! Remember that there were three plots within each site and from our extensive time in the field we have developed a strong intuition that there might be substantial among-plot variation at each site. We can make a quick exploratory graph to facilitate an 'eyeball test' of whether the data show what our intuition suggest.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = plot, fill = plot)) +\n geom_violin(alpha = 0.5) + # <1>\n geom_jitter(width = 0.25, size = 0.5) +\n facet_wrap(site ~ .) +\n theme_bw() +\n theme(axis.text.x = element_text(angle = 35, hjust = 1)) # <2>\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-explore-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n1. Violin plots are a nice alternative to boxplots because they allow visualizing data distributions directly rather than requiring an intutive grasp of the distribution metrics described by each bit of a boxplot\n2. This is allowing us to 'tilt' the X axis tick labels so they don't overlap with one another\n\nThis graph clearly supports our intuition that among-plot variation is dramatic! We _could_ account for this by including plot as a fixed effect but we'll need to sacrifice a lot of degrees of freedom (can be thought of as \"statistical power\") for a variable that we don't actually care about. Instead, we could include plot as another random effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the new model\ntarantula_mem2 <- lmerTest::lmer(tarantula_count ~ road.dist_km + \n (1|year) + (1|site.plot), # <1>\n data = tarantula_df)\n\n# Extract summary\nsummary(tarantula_mem2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLinear mixed model fit by REML. t-tests use Satterthwaite's method [\nlmerModLmerTest]\nFormula: tarantula_count ~ road.dist_km + (1 | year) + (1 | site.plot)\n Data: tarantula_df\n\nREML criterion at convergence: 4471.1\n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2684 -0.5698 -0.0689 0.5857 3.3902 \n\nRandom effects:\n Groups Name Variance Std.Dev.\n site.plot (Intercept) 6671.0 81.68 \n year (Intercept) 1589.6 39.87 \n Residual 881.2 29.69 \nNumber of obs: 450, groups: site.plot, 30; year, 3\n\nFixed effects:\n Estimate Std. Error df t value Pr(>|t|) \n(Intercept) 4.3262 51.6178 23.3405 0.084 0.9339 \nroad.dist_km 2.6306 0.9633 28.0001 2.731 0.0108 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCorrelation of Fixed Effects:\n (Intr)\nroad.dst_km -0.847\n```\n\n\n:::\n:::\n\n1. Note that we need to use this column as the random effect because plots are not uniquely named across sites (i.e., all sites have plots \"a\", \"b\", and \"c\"). Making the random effect just the 'plot' column would fail to reflect how plots are nested within each site\n\nThis test reveals that while there is a significant relationship between road distance and tarantula population but the effect is not nearly as strong as it was when we let plot-level variation be ignored. This is likely due to high (or low) average populations in a single plot skewing the site-level average. Still, this is a result we can be more confident in because we've now accounted for all known sources of variation in our data--either by including them as fixed effects or including them as a random effects.\n\nWe can create one more graph of our tidy data and use some aesthetic settings to make sure the nested structure of the data is clear to those looking at our work. Note that you could also use predicted values from the model itself though that choice is--arguably--a matter of personal preference.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(tarantula_df, aes(y = tarantula_count, x = road.dist_km)) +\n geom_point(aes(color = plot, shape = as.factor(year)), size = 2, alpha = 0.5) +\n geom_smooth(method = \"lm\", formula = \"y ~ x\", se = T, color = \"black\") +\n labs(y = \"Tarantula Abundance\", x = \"Distance to Nearest Road (km)\") +\n theme_bw()\n```\n\n::: {.cell-output-display}\n![](mod_stats_files/figure-html/mem-final-graph-1.png){fig-align='center' width=672}\n:::\n:::\n\n\n## Multi-Model Inference\n\nRegardless of your choice of statistical test, multi-model inference may be an appropriate method to use to assess your hypothesis. As stated earlier, this frames your research question as a case of which variables _best_ explain the data rather than the likelihood of the observed effect relating to any variable in particular. \n\nTo begin, it can be helpful to write out all possible \"candidate models\". For instance, let's say that you measured some response variable (Y) and several potential explanatory variables (X, W, and Z). We would then fit the following candidate models:\n\n1. X alone explains the most variation in Y\n2. W alone explains the most variation in Y\n3. Z alone explains the most variation in Y\n4. X, W, and Z together explain the most variation in Y\n\nWe might also fit other candidate models for pairs of X, W, and Z but for the sake of simplicity in this hypothetical we'll skip those. Note that for this method to be appropriate you need to fit the same type of model in all cases!\n\nOnce we've fit all of our models and assigned them to objects, we can use the `AIC` function included in base R to compare the AIC score of each model. \"AIC\" stands for Akaike (_AH-kuh-ee-kay_) Information Criterion and is one of several related information criteria for summarizing a model's explanatory power. Models with more parameters are penalized to make it mathematically possible for a model with fewer explanatory variables to still do a better job capturing the variation in the data.\n\nThe model with the _lowest_ AIC best explains the data. Technically any difference in AIC indicates model improvement but many scientists use a rule of thumb of a difference of 2. So, if two models have AIC scores that differ by less than 2, you can safely say that they have comparable explanatory power. That is definitely a semi-arbitrary threshold but so is the 0.05 threshold for _p_-value \"significance\".\n\n### AIC Case Study\n\nLet's check out an example using AIC to compare the strengths of several models. Rather than using simulated data--as we did earlier in the mixed-effect model section--we'll use some real penguin data included in the `palmerpenguins` package.\n\nThis dataset includes annual data on three penguin species spread across several islands. The sex of the penguins was also recorded in addition to the length of their flippers, body mass, and bill length and depth.\n\nFor the purposes of this example, our research question is as follows: **what factors best explain penguin body mass?**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the penguins data from the `palmerpenguins` package\ndata(penguins)\n\n# Make a version where no NAs are allowed\npeng_complete <- penguins[complete.cases(penguins), ] # <1>\n\n# Check the structure of it\ndplyr::glimpse(peng_complete)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 333\nColumns: 8\n$ species Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…\n$ island Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…\n$ bill_length_mm 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…\n$ bill_depth_mm 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…\n$ flipper_length_mm 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…\n$ body_mass_g 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…\n$ sex male, female, female, female, male, female, male, fe…\n$ year 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…\n```\n\n\n:::\n:::\n\n1. This is a base R way of keeping only rows that have _no_ `NA` values in any column. It is better to identify and handle `NA`s more carefully but for this context we just want to have the same number of observations in each model\n\nWith our data in hand and research question in mind, we can fit several candidate models that our scientific intuition and the published literature support as probable then compare them with AIC.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Species and sex\nmod_spp <- lm(body_mass_g ~ species + sex, data = peng_complete)\n\n# Island alone\nmod_isl <- lm(body_mass_g ~ island, data = peng_complete)\n\n# Combination of species and island\nmod_eco <- lm(body_mass_g ~ island + species + sex, data = peng_complete)\n\n# Body characteristics alone\nmod_phys <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,\n data = peng_complete)\n\n# Global model\nmod_sink <- lm(body_mass_g ~ island + species + sex + # <1>\n flipper_length_mm + bill_length_mm + bill_depth_mm,\n data = peng_complete)\n```\n:::\n\n1. We've named the global model \"sink\" because of the American idiom \"everything but the kitchen sink.\" It is used in cases where everything that can be included has been\n\nOnce we've fit all of these models, we can use the `AIC` function from base R (technically from the `stats` package included in base R).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compare models\nAIC(mod_spp, mod_isl, mod_eco, mod_phys, mod_sink) %>% \n dplyr::arrange(AIC) # <1>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n df AIC\nmod_sink 10 4727.242\nmod_spp 5 4785.594\nmod_eco 7 4789.480\nmod_phys 5 4929.554\nmod_isl 4 5244.224\n```\n\n\n:::\n:::\n\n1. Unfortunately, the `AIC` function doesn't sort by AIC score automatically so we're using the `arrange` function to make it easier for us to rank models by their AIC scores\n\nInterestingly, it looks like the best model (i.e., the one that explains most of the data) is the global model that included most of the available variables. As stated earlier, it is not always the case that the model with the most parameters has the lowest AIC so we can be confident this is a \"real\" result. The difference between that one and the next (incidentally the model where only species and sex are included as explanatory variables) is much larger than 2 so we can be confident that the global model is much better than the next best.\n\nWith this result your interpretation would be that penguin body mass is better explained by a combination of species, sex, physical characteristics of the individual penguin, and the penguin's home island than it is by any of the other candidate models. In a publication you'd likely want to report this entire AIC table (either parenthetically or in a table) so that reviewers could evaluate your logic.\n\n## Meta-Analysis\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- [Understanding ‘It Depends’ in Ecology: A Guide to Hypothesising, Visualising and Interpreting Statistical Interactions](https://onlinelibrary.wiley.com/doi/10.1111/brv.12939). Spake _et al._, 2023. **Biological Reviews** \n- [Improving Quantitative Synthesis to Achieve Generality in Ecology](https://www.nature.com/articles/s41559-022-01891-z). Spake _et al._, 2022.**Nature Ecology and Evolution**\n- [Doing Meta-Analysis with R: A Hands-On Guide](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/). Harrier _et al._ 2023.\n- [Mixed Effects Models and Extensions in Ecology with R](https://link.springer.com/book/10.1007/978-0-387-87458-6). Zuur _et al._, 2009.\n\n### Workshops & Courses\n\n- Matt Vuorre's [Bayesian Meta-Analysis with R, Stan, and brms](https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/)\n\n### Websites\n\n- \n", "supporting": [ "mod_stats_files" ], diff --git a/_freeze/mod_stats/figure-html/mem-explore-graph-1.png b/_freeze/mod_stats/figure-html/mem-explore-graph-1.png index 38e60e6..e4d25af 100644 Binary files a/_freeze/mod_stats/figure-html/mem-explore-graph-1.png and b/_freeze/mod_stats/figure-html/mem-explore-graph-1.png differ diff --git a/_quarto.yml b/_quarto.yml index a6f0460..4dedfdc 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -52,7 +52,7 @@ website: href: mod_stats.qmd - section: "Phase IV -- Magnify" contents: - - text: "Next Steps & Proposals" + - text: "Next Steps" href: mod_next-steps.qmd - section: "Phase V -- Share" contents: diff --git a/mod_data-viz.qmd b/mod_data-viz.qmd index b78143b..9f7e52c 100644 --- a/mod_data-viz.qmd +++ b/mod_data-viz.qmd @@ -434,6 +434,16 @@ ggplot() + ::: +## Ordination + +If you are working with multivariate data (i.e., data where multiple columns are all response variables collectively) you may find ordination helpful. Ordination is the general term for many types of multivariate visualization but typically is used to refer to visualizing a distance or dissimiliarity measure of the data. Such measures collapse all of those columns of response variables into fewer (typically two) index values that are easier to visualize. Common examples of this include Principal Components Analysis (PCA), Non-Metric Multidimensional Scaling (NMS / NMDS), or Principal Coordinates Analysis (PCoA / "metric multidimensional scaling"). + +## Maps + +You may find it valuable to create a map as an additional way of visualizing data. Many synthesis groups do this--particularly when there is a strong spatial component to the research questions and/or hypotheses. + +Check out the [bonus spatial data module](https://lter.github.io/ssecr/mod_spatial.html) for more information on map-making if this is of interest! + ## Additional Resources ### Papers & Documents diff --git a/mod_findings.qmd b/mod_findings.qmd index 2369768..8d1eede 100644 --- a/mod_findings.qmd +++ b/mod_findings.qmd @@ -15,10 +15,18 @@ After completing this module you will be able to: - Determine audience motivations and interest - Translate communication into various formats based on efficacy with target group -## Module Content +## Effective Communication +## Publishing Code, Data, and Results + + + + +## Data Management Plans + + ## Additional Resources ### Papers & Documents diff --git a/mod_next-steps.qmd b/mod_next-steps.qmd index f8a420f..8f24266 100644 --- a/mod_next-steps.qmd +++ b/mod_next-steps.qmd @@ -1,5 +1,5 @@ --- -title: "Next Steps & Proposal Writing" +title: "Next Steps & Logic Models" --- ## Overview diff --git a/mod_reproducibility.qmd b/mod_reproducibility.qmd index 7663671..17f3794 100644 --- a/mod_reproducibility.qmd +++ b/mod_reproducibility.qmd @@ -63,6 +63,8 @@ Here are some rules to keep in mind as you decide how to organize your project: Keeping all inputs, outputs, and documentation in a single folder makes it easier to collaborate and share all project materials. Also, most programming applications (RStudio, VS Code, etc.) work best when all needed files are in the same folder. +Note that how you define "projct" may affect the number of folders you need! Some synthesis projects may separate data harmonization into its own project while for others that same effort might not warrant being considered as a separate project. Similarly, you may want to make a separate folder for each manuscript your group plans on writing so that the code for each paper is kept separate. + 2. **Organize content with sub-folders** Putting files that share a purpose or source into logical sub-folders is a great idea! This makes it easy to figure out where to put new content and reduces the effort of documenting project organization. Don't feel like you need to use an intricate web of sub-folders either! Just one level of sub-folders is enough for many projects. @@ -256,7 +258,7 @@ library(dplyr); library(magrittr); library(ggplot2) . . . ``` -In R the semicolon allows you to put multiple code operations in the same line of the script. Listing the needed libraries in this way thus lets everyone reading the code know exactly which packages they will need to have installed. +In R the semicolon allows you to put multiple code operations in the same line of the script. Listing the needed libraries in this way cuts down on the number of lines while still being precise about which packages are needed in the script. If you are feeling generous you could use the [`librarian` R package](https://cran.r-project.org/web/packages/librarian/index.html) to install packages that are not yet installed and simultaneously load all needed libraries. Note that users would still need to install librarian itself but this at least limits possible errors to one location. This is done like so: diff --git a/mod_spatial.qmd b/mod_spatial.qmd index d0789d8..d9a6725 100644 --- a/mod_spatial.qmd +++ b/mod_spatial.qmd @@ -27,7 +27,19 @@ If you'd like to follow along with the code chunks included throughout this modu install.packages("tidyverse") ``` -## Module Content +## Raster versus Vector Data + + + +## Coordinate Reference Systems + + + +## Making Maps + + + +## Extracting Spatial Data diff --git a/mod_stats.qmd b/mod_stats.qmd index 49103e3..b1add29 100644 --- a/mod_stats.qmd +++ b/mod_stats.qmd @@ -5,7 +5,12 @@ code-annotations: hover ## Overview -Given the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module in a different way than the others. One half of the module will use a "flipped approach" where project teams will share their proposed analyses with one another. The other half of the module will be dedicated to analyses that are more common in--or exclusive to--synthesis research. Content produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content. +Given the wide range in statistical training in graduate curricula (and corresponding breadth of experience among early career researchers), we'll be approaching this module by splitting it into two halves. + +1. First half: a "flipped approach" where project teams will share their proposed analyses with one another +2. Second half: typical instructional module dedicated to **analyses that are more common in--or exclusive to--synthesis research**. + +Content produced by project teams during the flipped half may be linked in the '[Additional Resources](https://lter.github.io/ssecr/mod_stats.html#additional-resources)' section at the bottom of this module at the discretion of each team. Otherwise the content of this module will focus only on the non-flipped content. ## Learning Objectives diff --git a/mod_version-control.qmd b/mod_version-control.qmd index 296cfb9..aadb029 100644 --- a/mod_version-control.qmd +++ b/mod_version-control.qmd @@ -17,20 +17,70 @@ After completing this module you will be able to: - Sketch the RStudio-to-GitHub order of operations - Use RStudio, Git, and GitHub to collaborate with version control -## Module Content +## NCEAS SciComp Workshop Materials The workshop materials we will be working through live [here](https://nceas.github.io/scicomp-workshop-collaborative-coding/) but for convenience we have also embedded the workshop directly into the SSECR course website (see below). ```{=html} - + ``` +## Collaborating with Git + +It is important to remember that while Git is a phenomenal tool for collaboration, it is _not_ Google Docs! You can work together but you cannot work simultaneously in the same files. Working at the same time is how merge conflicts happen which can be a huge pain to untangle after the fact. Fortunately, avoiding merge conflicts is relatively simple! Here are a few strategies for avoiding conflicts. + +:::{.panel-tabset} +## Separate Scripts + +At it's simplest, you can make a separate script for each group member and have each of you work _exclusively_ in your own script. If no one ever works in your script you will never have a merge conflict even if you are working in your script at the same time as someone else is working in theirs. + +You can do this by all working on separate scripts that are trying to do the same thing or you can delegate a particular script in the workflow to a single person (e.g., one person is the only one allowed to edit the 'data wrangling' script, another is the only one allowed to edit the 'analysis' script, etc.) + +**Recommendation: Worth Discussing!** + +## Work in Shifts + +You might also decide to work together on the same scripts and just stagger the time that you are doing stuff so that all of your changes are made, committed, and pushed before the next person begins work. This is a particularly nice option if you have people in different time zones because someone in Maine can work on code likely before another team member living in Oregon has even woken up much less started working on code. + +For this to work _you will need to communicate extensively with the rest of your team_ so that you are absolutely sure that you won't start working before someone else has finished their edits. + +**Recommendation: Worth Discussing!** + +## Work in Forks + +GitHub does offer a "fork" feature where people can make a copy of a given repository that they then 'own'. Forks are connected to the source repository and you can open a pull request to get the edits from one fork into the source repository. + +This may sound like a perfect fit for collaboration but in reality it introduces significant hurdles! Consider the following: + +1. It is difficult to know where the "best" version of the code lives + +It is equally likely for the primary code version to be in any group member's fork (or the original fork). So if you want to re-run a set of analyses you'll need to hunt down which fork the current script lives in rather than consulting a single repository in which you all work together. + +2. You essentially gaurantee significant merge conflicts + +If everyone is working independently and submitting pull requests to merge back into the main repository you all but ensure that people will make different edits that GitHub then doesn't know how to resolve. The pull request will tell you that there are merge conflicts but you still need to fix them yourself--and now that fixing effort must be done in someone else's fork of the repository. + +3. It's not the intended use of GitHub forks + +Forks are intended for when you want to take a set of code and then "go your own way" with that code base. While there is a mechanism for contributing those edits back to the main repository it's really better used when you never intend to do a pull request and thus don't have to worry about eventual merge conflicts. A good example here is you might attend a workshop and decide to offer a similar workshop yourself. You could then fork the original workshop's repository to serve as a starting point for your version and save yourself from unnecessary labor. It would be bizarre for you to suggest that your workshop should _replace_ the original one even if did begin with that content. + +**Recommendation: Don't Do This** + +## Single Code Author + +You may be tempted to just delegate all code editing to a single person in the group. While this strategy does guarantee that there will never be a merge conflict it is also deeply inequitable as it places an unfair share of the labor of the project on one person. + +Practically-speaking this also encourages an atmosphere where only one person can even _read_ your group's code. This makes it difficult for other group members to contribute and ultimately may cause your group to 'miss out on' novel insights. + +**Recommendation: Don't Do This** +::: + ## Additional Resources ### Papers & Documents -- [Not Just for Programmers: How GitHub can Accelerate Collaborative and Reproducible Research in Ecology and Evolution](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14108). Pereira Braga _et al._, 2023. **Methods in Ecology and Evolution** -- [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf). GitHub +- Pereira Braga _et al._, [Not Just for Programmers: How GitHub can Accelerate Collaborative and Reproducible Research in Ecology and Evolution](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14108). **2023**. _Methods in Ecology and Evolution_ +- GitHub, [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf). **2023**. ### Workshops & Courses diff --git a/policy_ai.qmd b/policy_ai.qmd index abc05ad..9a85b68 100644 --- a/policy_ai.qmd +++ b/policy_ai.qmd @@ -1,5 +1,5 @@ -Artificial intelligence (AI) tools are increasingly well-known and widely discussed in the context of data science. AI products can increase the efficiency of code writing and are becoming a common part of the data science landscape. For the purposes of this course, we **strongly recommend that you _do not_ use AI tools**. There is an under-discussed ethical consideration to the use and training of these tools in addition to their known practical limitations. However, the main reason we suggest you not use them for this class though is that leaning too heavily upon AI tools is likely to negatively impact your learning and skill acquisition. +Artificial intelligence (AI) tools are increasingly well-known and widely discussed in the context of data science. AI products can increase the efficiency of code writing and are becoming a common part of the data science landscape. For the purposes of this course, we **strongly recommend that you _do not_ use AI tools to write code**. There is an under-discussed ethical consideration to the use and training of these tools in addition to their known practical limitations. However, the main reason we suggest you not use them for this class though is that leaning too heavily upon AI tools is likely to negatively impact your learning and skill acquisition. You may have prior experience with some of the quantitative skills this course aims to teach but others are likely new to you. During the first steps of learning any new skill, it can be really helpful to struggle a bit in solving problems. Your efforts now will help refine your troubleshooting skills and will likely make it easier to remember how you solved a given problem the next time it arises. Over-use of AI tools can short circuit this pathway to mastery. Once you have become a proficient coder, you will be better able to identify and avoid any distortions or assumptions introduced by relying on AI.