From a3a99cf99a97d5f94733c9f6b2b01bcea96f592b Mon Sep 17 00:00:00 2001 From: njlyon0 Date: Fri, 19 Apr 2024 10:47:34 -0400 Subject: [PATCH] Fleshed out column uniting/separating topic (in data wrangling module) and moved custom function topic header to the end --- _freeze/mod_wrangle/execute-results/html.json | 4 +- mod_wrangle.qmd | 44 ++++++++++++++++--- 2 files changed, 40 insertions(+), 8 deletions(-) diff --git a/_freeze/mod_wrangle/execute-results/html.json b/_freeze/mod_wrangle/execute-results/html.json index 035359a..e24dd84 100644 --- a/_freeze/mod_wrangle/execute-results/html.json +++ b/_freeze/mod_wrangle/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "93863f4728e4e88bf466b3197979b7f1", + "hash": "7dc8173c5e2391b685571da40324c8cf", "result": { - "markdown": "---\ntitle: \"Data Harmonization & Wrangling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nNow that we have covered how to find data and use data visualization methods to explore it, we can move on to combining separate data files and preparing that combined data file for analysis. For the purposes of this module we're adopting a very narrow view of harmonization and a very broad view of wrangling but this distinction aligns well with two discrete philosophical/practical arenas. To make those definitions explicit:\n\n- \"Harmonization\" = process of combining separate primary data objects into one object. This includes things like synonymizing columns, or changing data format to support combination. This _excludes_ quality control steps--even those that are undertaken before harmonization begins.\n\n- \"Wrangling\" = all modifications to data meant to create an analysis-ready 'tidy' data object. This includes quality control, unit conversions, and data 'shape' changes to name a few. Note that attaching ancillary data to your primary data object (e.g., attaching temperature data to a dataset on plant species composition) _also falls into this category!_\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Identify typical steps in data harmonization and wrangling workflows\n- Create a harmonization workflow\n- Define quality control\n- Summarize typical operations in a quality control workflow\n- Use regular expressions to perform flexible text operations\n- Write custom functions to reduce code duplication\n- Identify value of and typical obstacles to data 'joining'\n- Explain benefits and drawbacks of using data shape to streamline code\n- Design a complete data wrangling workflow\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"ltertools\")\ninstall.packages(\"lterdatasampler\")\ninstall.packages(\"psych\")\ninstall.packages(\"supportR\")\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nWe'll load the Tidyverse meta-package here to have access to many of its useful tools when we need them later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load tidyverse\nlibrary(tidyverse)\n```\n:::\n\n\n\n## Harmonizing Data\n\nData harmonization is an interesting topic in that it is _vital_ for synthesis projects but only very rarely relevant for primary research. Synthesis projects must reckon with the data choices made by each team of original data collectors. These collectors may or may not have recorded their judgement calls (or indeed, any metadata) but before synthesis work can be meaningfully done these independent datasets must be made comparable to one another and combined.\n\nFor tabular data, we recommend using the [`ltertools` R package](https://lter.github.io/ltertools/) to perform any needed harmonization. This package relies on a \"column key\" to translate the original column names into equivalents that apply across all datasets. Users can generate this column key however they would like but Google Sheets is a strong option as it allows multiple synthesis team members to simultaneously work on filling in the needed bits of the key.\n\nThe column key requires three columns:\n\n1. \"source\" -- Name of the raw file\n2. \"raw_name\" -- Name of all raw columns in that file to be synonymized\n3. \"tidy_name\" -- New name for each raw column that should be carried to the harmonized data\n\nNote that any raw names either not included in the column key or that lack a tidy name equivalent will be excluded from the final data object. For more information, consult the `ltertools` [package vignette](https://lter.github.io/ltertools/articles/ltertools.html). For convenience, we're attaching the visual diagram of this method of harmonization from the package vignette.\n\n

\n\"Four\n

\n\n## Wrangling Data\n\nData wrangling is a _huge_ subject that covers a wide range of topics. In this part of the module, we'll attempt to touch on a wide range of tools that may prove valuable to your data wrangling efforts. This is certainly non-exhaustive and you'll likely find new tools that fit your coding style and professional intuition better. However, hopefully the topics covered below provide a nice 'jumping off' point to reproducibly prepare your data for analysis and visualization work later in the lifecycle of the project.\n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module.\n\nThis dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems LTER](https://pie-lter.ecosystems.mbl.edu/welcome-plum-island-ecosystems-lter) site.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the lterdatasampler package\nlibrary(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n```\n:::\n\n\n### Exploring the Data\n\nBefore beginning any code operations, it's important to get a sense for the data. Characteristics like the dimensions of the dataset, the column names, and the type of information stored in each column are all crucial pre-requisites to knowing what tools can or should be used on the data.\n\nChecking the data structure is one way of getting a lot of this high-level information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Check dataset structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n\nFor data that are primarily numeric, you may find data summary functions to be valuable. Note that most functions of this type do not provide useful information on text columns so you'll need to find that information elsewhere.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get a simple summary of the data\nsummary(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n date latitude site size \n Min. :2016-07-24 Min. :30.00 Length:392 Min. : 6.64 \n 1st Qu.:2016-07-28 1st Qu.:34.00 Class :character 1st Qu.:12.02 \n Median :2016-08-01 Median :39.10 Mode :character Median :14.44 \n Mean :2016-08-02 Mean :37.69 Mean :14.66 \n 3rd Qu.:2016-08-09 3rd Qu.:41.60 3rd Qu.:17.34 \n Max. :2016-08-13 Max. :42.70 Max. :23.43 \n air_temp air_temp_sd water_temp water_temp_sd \n Min. :10.29 Min. :6.391 Min. :13.98 Min. :4.838 \n 1st Qu.:12.05 1st Qu.:8.110 1st Qu.:14.33 1st Qu.:6.567 \n Median :13.93 Median :8.410 Median :17.50 Median :6.998 \n Mean :15.20 Mean :8.654 Mean :17.65 Mean :7.252 \n 3rd Qu.:18.63 3rd Qu.:9.483 3rd Qu.:20.54 3rd Qu.:7.865 \n Max. :21.79 Max. :9.965 Max. :24.50 Max. :9.121 \n name \n Length:392 \n Class :character \n Mode :character \n \n \n \n```\n:::\n:::\n\n\nFor text columns it can sometimes be useful to simply look at the unique entries in a given column and sort them alphabetically for ease of parsing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Look at the sites included in the data\nsort(unique(pie_crab$site))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"BC\" \"CC\" \"CT\" \"DB\" \"GTM\" \"JC\" \"NB\" \"NIB\" \"PIE\" \"RC\" \"SI\" \"VCR\"\n[13] \"ZI\" \n```\n:::\n:::\n\n\nFor those of you who think more visually, a histogram can be a nice way of examining numeric data. There are simple histogram functions in the 'base' packages of most programming languages but it can sometimes be worth it to use those from special libraries because they can often convey additional detail.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load the psych library\nlibrary(psych)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'psych'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:ggplot2':\n\n %+%, alpha\n```\n:::\n\n```{.r .cell-code}\n# Get the histogram of crab \"size\" (carapace width in mm)\npsych::multi.hist(pie_crab$size)\n```\n\n::: {.cell-output-display}\n![](mod_wrangle_files/figure-html/multi-hist-1.png){fig-align='center' width=384}\n:::\n:::\n\n\n### Quality Control\n\nYou may have encountered the phrase \"QA/QC\" (Quality Assurance / Quality Control) in relation to data cleaning. Technically, quality assurance only encapsulates _preventative_ measures for reducing errors. One example of QA would be using a template for field datasheets because using standard fields reduces the risk that data are recorded inconsistently and/or incompletely. Quality control on the other hand refers to all steps taken to resolve errors _after_ data are collected. Any code that you write to fix typos or remove outliers from a dataset falls under the umbrella of QC.\n\nIn synthesis work, QA is only very rarely an option. You'll be working with datasets that have already been collected and attempting to handle any issues _post hoc_ which means the vast majority of data wrangling operations will be quality control methods. These QC efforts can be **incredibly** time-consuming so using a programming language (like {{< fa brands r-project >}} R or {{< fa brands python >}} Python) is a dramatic improvement over manually looking through the data using Microsoft Excel or other programs like it.\n\n#### Number Checking\n\nWhen you read in a dataset and a column that _should be_ numeric is instead read in as a character, it can be a sign that there are malformed numbers lurking in the background. Checking for and resolving these non-numbers is preferable to simply coercing the column into being numeric because the latter method typically changes those values to 'NA' where a human might be able to deduce the true number each value 'should be.'\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR package\nlibrary(supportR)\n\n# Create an example dataset with non-numbers in ideally numeric columns\nfaux_df <- data.frame(\"species\" = c(\"salmon\", \"bass\", \"halibut\", \"eel\"),\n \"count\" = c(1, \"14x\", \"_23\", 12))\n\n# Check for malformed numbers in column(s) that should be numeric\nbad_nums <- supportR::num_check(data = faux_df, col = \"count\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFor 'count', 2 non-numbers identified: '14x' | '_23'\n```\n:::\n:::\n\n\nIn the above example, \"14x\" would be coerced to NA if you simply force the column without checking but you could drop the \"x\" with text replacing methods once you use tools like this one to flag it for your attention.\n\n#### Text Replacement\n\nOne of the simpler ways of handling text issues is just to replace a string with another string. Most programming languages support this functionality.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Use pattern match/replace to simplify problem entries\nfaux_df$count <- gsub(pattern = \"x|_\", replacement = \"\", x = faux_df$count)\n\n# Check that they are fixed\nbad_nums <- supportR::num_check(data = faux_df, col = \"count\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFor 'count', no non-numeric values identified.\n```\n:::\n:::\n\n\nThe vertical line in the `gsub` example above lets us search for (and replace) multiple patterns. Note however that while you can search for many patterns at once, only a single replacement value can be provided with this function.\n\n#### Regular Expressions\n\nYou may sometimes want to perform more generic string matching where you don't necessarily know--or want to list--all possible strings to find and replace. For instance, you may want remove any letter in a numeric column or find and replace numbers with some sort of text note. \"Regular expressions\" are how programmers specify these generic matches and using them can be a nice way of streamlining code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a test vector\nregex_vec <- c(\"hello\", \"123\", \"goodbye\", \"456\")\n\n# Find all numbers and replace with the letter X\ngsub(pattern = \"[[:digit:]]\", replacement = \"x\", x = regex_vec)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"hello\" \"xxx\" \"goodbye\" \"xxx\" \n```\n:::\n\n```{.r .cell-code}\n# Replace any number of letters with only a single 0\ngsub(pattern = \"[[:alpha:]]+\", replacement = \"0\", x = regex_vec)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"0\" \"123\" \"0\" \"456\"\n```\n:::\n:::\n\n\n### Conditionals\n\nRather than finding and replacing content, you may want to create a new column based on the contents of a different column. In plain language you might phrase this as 'if column X has \\[some values\\] then column Y should have \\[other values\\]'. These operations are called conditionals and are an important part of data wrangling.\n\nIf you only want your conditional to support two outcomes (as in an either/or statement) there are useful functions that support this. Let's return to our Plum Island Ecosystems crab dataset for an example.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a new colum with an either/or conditional\npie_crab_v2 <- pie_crab %>% \n dplyr::mutate(size_category = ifelse(test = (size >= 15),\n yes = \"big\",\n no = \"small\")) # <1>\n\n# Check structure of the resulting data\nstr(pie_crab_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 10] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n```\n:::\n:::\n\n1. `mutate` makes a new column, `ifelse` is actually doing the conditional\n\nIf you have multiple different conditions you _can_ just stack these either/or conditionals together but this gets cumbersome quickly. It is preferable to instead use a function that supports as many alternates as you want!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a new column with several conditionals\npie_crab_v2 <- pie_crab %>% \n dplyr::mutate(size_category = dplyr::case_when( \n size <= 10 ~ \"tiny\", # <1>\n size > 10 & size <= 15 ~ \"small\",\n size > 15 & size <= 20 ~ \"big\",\n size > 20 ~ \"huge\",\n TRUE ~ \"uncategorized\")) # <2>\n\n# Check the results' structure\nstr(pie_crab_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 10] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n```\n:::\n:::\n\n1. Syntax is 'test ~ what to do when true'\n2. This line is a catch-all for any rows that _don't_ meet previous conditions\n\nNote that you can use functions like this one when you do have an either/or conditional if you prefer this format.\n\n### Custom Functions\n\n\n\n\n### Uniting / Separating Columns\n\n\n`tidyr::separate_wider_delim`\n\n### Joining Data\n\na.k.a. attaching data by columns\n\n`dplyr::left_join`\n\n`supportR::diff_check`\n\n\n### Leveraging Data Shape\n\n1. `tidyr::pivot_longer`\n2. operations on consolidated columns\n3. `tidyr::pivot_wider`\n\n\n\n\n\n\n\n\n\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- \n\n### Workshops & Courses\n\n- Data Analysis and Visualization in R for Ecologists, [Episode 4: Manipulating, Analyzing, and Exporting Data with `tidyverse`](https://datacarpentry.org/R-ecology-lesson/03-dplyr.html). The Carpentries\n- [Coding in the Tidyverse](https://nceas.github.io/scicomp-workshop-tidyverse/). NCEAS Scientific Computing Team, 2023.\n- coreR Course, [Chapter 8: Cleaning & Wrangling Data](https://learning.nceas.ucsb.edu/2023-10-coreR/session_08.html). NCEAS Learning Hub, 2023.\n- coreR Course, [Chapter 16: Writing Functions & Packages](https://learning.nceas.ucsb.edu/2023-10-coreR/session_16.html). NCEAS Learning Hub, 2023.\n\n### Websites\n\n- \n", + "markdown": "---\ntitle: \"Data Harmonization & Wrangling\"\ncode-annotations: hover\n---\n\n\n## Overview\n\nNow that we have covered how to find data and use data visualization methods to explore it, we can move on to combining separate data files and preparing that combined data file for analysis. For the purposes of this module we're adopting a very narrow view of harmonization and a very broad view of wrangling but this distinction aligns well with two discrete philosophical/practical arenas. To make those definitions explicit:\n\n- \"Harmonization\" = process of combining separate primary data objects into one object. This includes things like synonymizing columns, or changing data format to support combination. This _excludes_ quality control steps--even those that are undertaken before harmonization begins.\n\n- \"Wrangling\" = all modifications to data meant to create an analysis-ready 'tidy' data object. This includes quality control, unit conversions, and data 'shape' changes to name a few. Note that attaching ancillary data to your primary data object (e.g., attaching temperature data to a dataset on plant species composition) _also falls into this category!_\n\n## Learning Objectives\n\nAfter completing this module you will be able to: \n\n- Identify typical steps in data harmonization and wrangling workflows\n- Create a harmonization workflow\n- Define quality control\n- Summarize typical operations in a quality control workflow\n- Use regular expressions to perform flexible text operations\n- Write custom functions to reduce code duplication\n- Identify value of and typical obstacles to data 'joining'\n- Explain benefits and drawbacks of using data shape to streamline code\n- Design a complete data wrangling workflow\n\n## Needed Packages\n\nIf you'd like to follow along with the code chunks included throughout this module, you'll need to install the following packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Note that these lines only need to be run once per computer\n## So you can skip this step if you've installed these before\ninstall.packages(\"ltertools\")\ninstall.packages(\"lterdatasampler\")\ninstall.packages(\"psych\")\ninstall.packages(\"supportR\")\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nWe'll load the Tidyverse meta-package here to have access to many of its useful tools when we need them later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load tidyverse\nlibrary(tidyverse)\n```\n:::\n\n\n\n## Harmonizing Data\n\nData harmonization is an interesting topic in that it is _vital_ for synthesis projects but only very rarely relevant for primary research. Synthesis projects must reckon with the data choices made by each team of original data collectors. These collectors may or may not have recorded their judgement calls (or indeed, any metadata) but before synthesis work can be meaningfully done these independent datasets must be made comparable to one another and combined.\n\nFor tabular data, we recommend using the [`ltertools` R package](https://lter.github.io/ltertools/) to perform any needed harmonization. This package relies on a \"column key\" to translate the original column names into equivalents that apply across all datasets. Users can generate this column key however they would like but Google Sheets is a strong option as it allows multiple synthesis team members to simultaneously work on filling in the needed bits of the key.\n\nThe column key requires three columns:\n\n1. \"source\" -- Name of the raw file\n2. \"raw_name\" -- Name of all raw columns in that file to be synonymized\n3. \"tidy_name\" -- New name for each raw column that should be carried to the harmonized data\n\nNote that any raw names either not included in the column key or that lack a tidy name equivalent will be excluded from the final data object. For more information, consult the `ltertools` [package vignette](https://lter.github.io/ltertools/articles/ltertools.html). For convenience, we're attaching the visual diagram of this method of harmonization from the package vignette.\n\n

\n\"Four\n

\n\n## Wrangling Data\n\nData wrangling is a _huge_ subject that covers a wide range of topics. In this part of the module, we'll attempt to touch on a wide range of tools that may prove valuable to your data wrangling efforts. This is certainly non-exhaustive and you'll likely find new tools that fit your coding style and professional intuition better. However, hopefully the topics covered below provide a nice 'jumping off' point to reproducibly prepare your data for analysis and visualization work later in the lifecycle of the project.\n\nThis module will use example data to demonstrate these tools but as we work through these topics you should feel free to substitute a dataset of your choosing! If you don't have one in mind, you can use the example dataset shown in the code chunks throughout this module.\n\nThis dataset comes from the [`lterdatasampler` R package](https://lter.github.io/lterdatasampler/) and the data are about fiddler crabs (_Minuca pugnax_) at the [Plum Island Ecosystems LTER](https://pie-lter.ecosystems.mbl.edu/welcome-plum-island-ecosystems-lter) site.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the lterdatasampler package\nlibrary(lterdatasampler)\n\n# Load the fiddler crab dataset\ndata(pie_crab)\n```\n:::\n\n\n### Exploring the Data\n\nBefore beginning any code operations, it's important to get a sense for the data. Characteristics like the dimensions of the dataset, the column names, and the type of information stored in each column are all crucial pre-requisites to knowing what tools can or should be used on the data.\n\nChecking the data structure is one way of getting a lot of this high-level information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Check dataset structure\nstr(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 9] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n\nFor data that are primarily numeric, you may find data summary functions to be valuable. Note that most functions of this type do not provide useful information on text columns so you'll need to find that information elsewhere.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get a simple summary of the data\nsummary(pie_crab)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n date latitude site size \n Min. :2016-07-24 Min. :30.00 Length:392 Min. : 6.64 \n 1st Qu.:2016-07-28 1st Qu.:34.00 Class :character 1st Qu.:12.02 \n Median :2016-08-01 Median :39.10 Mode :character Median :14.44 \n Mean :2016-08-02 Mean :37.69 Mean :14.66 \n 3rd Qu.:2016-08-09 3rd Qu.:41.60 3rd Qu.:17.34 \n Max. :2016-08-13 Max. :42.70 Max. :23.43 \n air_temp air_temp_sd water_temp water_temp_sd \n Min. :10.29 Min. :6.391 Min. :13.98 Min. :4.838 \n 1st Qu.:12.05 1st Qu.:8.110 1st Qu.:14.33 1st Qu.:6.567 \n Median :13.93 Median :8.410 Median :17.50 Median :6.998 \n Mean :15.20 Mean :8.654 Mean :17.65 Mean :7.252 \n 3rd Qu.:18.63 3rd Qu.:9.483 3rd Qu.:20.54 3rd Qu.:7.865 \n Max. :21.79 Max. :9.965 Max. :24.50 Max. :9.121 \n name \n Length:392 \n Class :character \n Mode :character \n \n \n \n```\n:::\n:::\n\n\nFor text columns it can sometimes be useful to simply look at the unique entries in a given column and sort them alphabetically for ease of parsing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Look at the sites included in the data\nsort(unique(pie_crab$site))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"BC\" \"CC\" \"CT\" \"DB\" \"GTM\" \"JC\" \"NB\" \"NIB\" \"PIE\" \"RC\" \"SI\" \"VCR\"\n[13] \"ZI\" \n```\n:::\n:::\n\n\nFor those of you who think more visually, a histogram can be a nice way of examining numeric data. There are simple histogram functions in the 'base' packages of most programming languages but it can sometimes be worth it to use those from special libraries because they can often convey additional detail.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# Load the psych library\nlibrary(psych)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'psych'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:ggplot2':\n\n %+%, alpha\n```\n:::\n\n```{.r .cell-code}\n# Get the histogram of crab \"size\" (carapace width in mm)\npsych::multi.hist(pie_crab$size)\n```\n\n::: {.cell-output-display}\n![](mod_wrangle_files/figure-html/multi-hist-1.png){fig-align='center' width=384}\n:::\n:::\n\n\n### Quality Control\n\nYou may have encountered the phrase \"QA/QC\" (Quality Assurance / Quality Control) in relation to data cleaning. Technically, quality assurance only encapsulates _preventative_ measures for reducing errors. One example of QA would be using a template for field datasheets because using standard fields reduces the risk that data are recorded inconsistently and/or incompletely. Quality control on the other hand refers to all steps taken to resolve errors _after_ data are collected. Any code that you write to fix typos or remove outliers from a dataset falls under the umbrella of QC.\n\nIn synthesis work, QA is only very rarely an option. You'll be working with datasets that have already been collected and attempting to handle any issues _post hoc_ which means the vast majority of data wrangling operations will be quality control methods. These QC efforts can be **incredibly** time-consuming so using a programming language (like {{< fa brands r-project >}} R or {{< fa brands python >}} Python) is a dramatic improvement over manually looking through the data using Microsoft Excel or other programs like it.\n\n#### Number Checking\n\nWhen you read in a dataset and a column that _should be_ numeric is instead read in as a character, it can be a sign that there are malformed numbers lurking in the background. Checking for and resolving these non-numbers is preferable to simply coercing the column into being numeric because the latter method typically changes those values to 'NA' where a human might be able to deduce the true number each value 'should be.'\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Load the supportR package\nlibrary(supportR)\n\n# Create an example dataset with non-numbers in ideally numeric columns\nfaux_df <- data.frame(\"species\" = c(\"salmon\", \"bass\", \"halibut\", \"eel\"),\n \"count\" = c(1, \"14x\", \"_23\", 12))\n\n# Check for malformed numbers in column(s) that should be numeric\nbad_nums <- supportR::num_check(data = faux_df, col = \"count\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFor 'count', 2 non-numbers identified: '14x' | '_23'\n```\n:::\n:::\n\n\nIn the above example, \"14x\" would be coerced to NA if you simply force the column without checking but you could drop the \"x\" with text replacing methods once you use tools like this one to flag it for your attention.\n\n#### Text Replacement\n\nOne of the simpler ways of handling text issues is just to replace a string with another string. Most programming languages support this functionality.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Use pattern match/replace to simplify problem entries\nfaux_df$count <- gsub(pattern = \"x|_\", replacement = \"\", x = faux_df$count)\n\n# Check that they are fixed\nbad_nums <- supportR::num_check(data = faux_df, col = \"count\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nFor 'count', no non-numeric values identified.\n```\n:::\n:::\n\n\nThe vertical line in the `gsub` example above lets us search for (and replace) multiple patterns. Note however that while you can search for many patterns at once, only a single replacement value can be provided with this function.\n\n#### Regular Expressions\n\nYou may sometimes want to perform more generic string matching where you don't necessarily know--or want to list--all possible strings to find and replace. For instance, you may want remove any letter in a numeric column or find and replace numbers with some sort of text note. \"Regular expressions\" are how programmers specify these generic matches and using them can be a nice way of streamlining code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a test vector\nregex_vec <- c(\"hello\", \"123\", \"goodbye\", \"456\")\n\n# Find all numbers and replace with the letter X\ngsub(pattern = \"[[:digit:]]\", replacement = \"x\", x = regex_vec)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"hello\" \"xxx\" \"goodbye\" \"xxx\" \n```\n:::\n\n```{.r .cell-code}\n# Replace any number of letters with only a single 0\ngsub(pattern = \"[[:alpha:]]+\", replacement = \"0\", x = regex_vec)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"0\" \"123\" \"0\" \"456\"\n```\n:::\n:::\n\n\n### Conditionals\n\nRather than finding and replacing content, you may want to create a new column based on the contents of a different column. In plain language you might phrase this as 'if column X has \\[some values\\] then column Y should have \\[other values\\]'. These operations are called conditionals and are an important part of data wrangling.\n\nIf you only want your conditional to support two outcomes (as in an either/or statement) there are useful functions that support this. Let's return to our Plum Island Ecosystems crab dataset for an example.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a new colum with an either/or conditional\npie_crab_v2 <- pie_crab %>% \n dplyr::mutate(size_category = ifelse(test = (size >= 15), # <1>\n yes = \"big\",\n no = \"small\"),\n .after = size) \n\n# Check structure of the resulting data\nstr(pie_crab_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 10] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n1. `mutate` makes a new column, `ifelse` is actually doing the conditional\n\nIf you have multiple different conditions you _can_ just stack these either/or conditionals together but this gets cumbersome quickly. It is preferable to instead use a function that supports as many alternates as you want!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Make a new column with several conditionals\npie_crab_v2 <- pie_crab %>% \n dplyr::mutate(size_category = dplyr::case_when( \n size <= 10 ~ \"tiny\", # <1>\n size > 10 & size <= 15 ~ \"small\",\n size > 15 & size <= 20 ~ \"big\",\n size > 20 ~ \"huge\",\n TRUE ~ \"uncategorized\"), # <2>\n .after = size)\n\n# Check the results' structure\nstr(pie_crab_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 10] (S3: tbl_df/tbl/data.frame)\n $ date : Date[1:392], format: \"2016-07-24\" \"2016-07-24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n1. Syntax is 'test ~ what to do when true'\n2. This line is a catch-all for any rows that _don't_ meet previous conditions\n\nNote that you can use functions like this one when you do have an either/or conditional if you prefer this format.\n\n### Uniting / Separating Columns\n\nSometimes one column has multiple pieces of information that you'd like to consider separately. A date column is a common example of this because particular months are always in a given season regardless of the specific day or year. So, it can be useful to break a complete date (i.e., year/month/day) into its component bits to be better able to access those pieces of information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Split date into each piece of temporal info\npie_crab_v3 <- pie_crab_v2 %>% \n tidyr::separate_wider_delim(cols = date, \n delim = \"-\", # <1>\n names = c(\"year\", \"month\", \"day\"),\n cols_remove = TRUE) # <2>\n\n# Check that out\nstr(pie_crab_v3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 12] (S3: tbl_df/tbl/data.frame)\n $ year : chr [1:392] \"2016\" \"2016\" \"2016\" \"2016\" ...\n $ month : chr [1:392] \"07\" \"07\" \"07\" \"07\" ...\n $ day : chr [1:392] \"24\" \"24\" \"24\" \"24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n1. 'delim' is short for \"delimiter\" which we covered in the Reproducibility module\n2. This argument specifies whether to remove the original column when making the new columns\n\nWhile breaking apart a column's contents can be useful, it can also be helpful to combine the contents of several columns!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Re-combine data information back into date\npie_crab_v4 <- pie_crab_v3 %>% \n tidyr::unite(col = \"date\",\n sep = \"/\", # <1>\n year:day, \n remove = FALSE) # <2>\n\n# Structure check\nstr(pie_crab_v4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ntibble [392 × 13] (S3: tbl_df/tbl/data.frame)\n $ date : chr [1:392] \"2016/07/24\" \"2016/07/24\" \"2016/07/24\" \"2016/07/24\" ...\n $ year : chr [1:392] \"2016\" \"2016\" \"2016\" \"2016\" ...\n $ month : chr [1:392] \"07\" \"07\" \"07\" \"07\" ...\n $ day : chr [1:392] \"24\" \"24\" \"24\" \"24\" ...\n $ latitude : num [1:392] 30 30 30 30 30 30 30 30 30 30 ...\n $ site : chr [1:392] \"GTM\" \"GTM\" \"GTM\" \"GTM\" ...\n $ size : num [1:392] 12.4 14.2 14.5 12.9 12.4 ...\n $ size_category: chr [1:392] \"small\" \"small\" \"small\" \"small\" ...\n $ air_temp : num [1:392] 21.8 21.8 21.8 21.8 21.8 ...\n $ air_temp_sd : num [1:392] 6.39 6.39 6.39 6.39 6.39 ...\n $ water_temp : num [1:392] 24.5 24.5 24.5 24.5 24.5 ...\n $ water_temp_sd: num [1:392] 6.12 6.12 6.12 6.12 6.12 ...\n $ name : chr [1:392] \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" \"Guana Tolomoto Matanzas NERR\" ...\n```\n:::\n:::\n\n1. This is equivalent to the 'delim' argument in the previous function\n2. Comparable to the 'cols_remove' argument in the previous function\n\nNote in this output how despite re-combining data information the column is listed as a character column! Simply combining or separating data is not always enough so you need to really lean into frequent data structure checks to be sure that your data are structured in the way that you want.\n\n### Joining Data\n\na.k.a. attaching data by columns\n\n`dplyr::left_join`\n\n`supportR::diff_check`\n\n\n### Leveraging Data Shape\n\n1. `tidyr::pivot_longer`\n2. operations on consolidated columns\n3. `tidyr::pivot_wider`\n\n\n\n### Custom Functions\n\n\n\n\n\n\n\n\n\n\n\n\n## Additional Resources\n\n### Papers & Documents\n\n- \n\n### Workshops & Courses\n\n- Data Analysis and Visualization in R for Ecologists, [Episode 4: Manipulating, Analyzing, and Exporting Data with `tidyverse`](https://datacarpentry.org/R-ecology-lesson/03-dplyr.html). The Carpentries\n- [Coding in the Tidyverse](https://nceas.github.io/scicomp-workshop-tidyverse/). NCEAS Scientific Computing Team, 2023.\n- coreR Course, [Chapter 8: Cleaning & Wrangling Data](https://learning.nceas.ucsb.edu/2023-10-coreR/session_08.html). NCEAS Learning Hub, 2023.\n- coreR Course, [Chapter 16: Writing Functions & Packages](https://learning.nceas.ucsb.edu/2023-10-coreR/session_16.html). NCEAS Learning Hub, 2023.\n\n### Websites\n\n- \n", "supporting": [ "mod_wrangle_files" ], diff --git a/mod_wrangle.qmd b/mod_wrangle.qmd index 8329ee5..0a0d647 100644 --- a/mod_wrangle.qmd +++ b/mod_wrangle.qmd @@ -190,9 +190,10 @@ If you only want your conditional to support two outcomes (as in an either/or st ```{r ifelse} # Make a new colum with an either/or conditional pie_crab_v2 <- pie_crab %>% - dplyr::mutate(size_category = ifelse(test = (size >= 15), + dplyr::mutate(size_category = ifelse(test = (size >= 15), # <1> yes = "big", - no = "small")) # <1> + no = "small"), + .after = size) # Check structure of the resulting data str(pie_crab_v2) @@ -209,7 +210,8 @@ pie_crab_v2 <- pie_crab %>% size > 10 & size <= 15 ~ "small", size > 15 & size <= 20 ~ "big", size > 20 ~ "huge", - TRUE ~ "uncategorized")) # <2> + TRUE ~ "uncategorized"), # <2> + .after = size) # Check the results' structure str(pie_crab_v2) @@ -219,15 +221,41 @@ str(pie_crab_v2) Note that you can use functions like this one when you do have an either/or conditional if you prefer this format. -### Custom Functions +### Uniting / Separating Columns +Sometimes one column has multiple pieces of information that you'd like to consider separately. A date column is a common example of this because particular months are always in a given season regardless of the specific day or year. So, it can be useful to break a complete date (i.e., year/month/day) into its component bits to be better able to access those pieces of information. +```{r separate-wider-delim} +# Split date into each piece of temporal info +pie_crab_v3 <- pie_crab_v2 %>% + tidyr::separate_wider_delim(cols = date, + delim = "-", # <1> + names = c("year", "month", "day"), + cols_remove = TRUE) # <2> +# Check that out +str(pie_crab_v3) +``` +1. 'delim' is short for "delimiter" which we covered in the Reproducibility module +2. This argument specifies whether to remove the original column when making the new columns -### Uniting / Separating Columns +While breaking apart a column's contents can be useful, it can also be helpful to combine the contents of several columns! + +```{r unite} +# Re-combine data information back into date +pie_crab_v4 <- pie_crab_v3 %>% + tidyr::unite(col = "date", + sep = "/", # <1> + year:day, + remove = FALSE) # <2> +# Structure check +str(pie_crab_v4) +``` +1. This is equivalent to the 'delim' argument in the previous function +2. Comparable to the 'cols_remove' argument in the previous function -`tidyr::separate_wider_delim` +Note in this output how despite re-combining data information the column is listed as a character column! Simply combining or separating data is not always enough so you need to really lean into frequent data structure checks to be sure that your data are structured in the way that you want. ### Joining Data @@ -246,6 +274,10 @@ a.k.a. attaching data by columns +### Custom Functions + + +