diff --git a/images/image_data-stages.png b/images/image_data-stages.png new file mode 100644 index 0000000..3a2739c Binary files /dev/null and b/images/image_data-stages.png differ diff --git a/mod_reproducibility.qmd b/mod_reproducibility.qmd index 218b84a..71b5ba0 100644 --- a/mod_reproducibility.qmd +++ b/mod_reproducibility.qmd @@ -240,6 +240,19 @@ When your scripts are clear and reproducibly-written you will reap the following 3. Sharing methods for external result validation is more straightforward 4. In cases where you're developing a novel method or workflow, structuring your code in this way will increase the odds that someone outside of your team will adopt your strategy +### Code and the Stages of Data + +You'll likely need a number of scripts to accomplish the different stages of preparing a synthesized dataset. All of these scripts together are often called a "workflow." Each script will meet a specific need and its outputs will be the inputs of the next script. These intermediary data products are sometimes useful in and of themselves and tend to occur and predictable points that exist in most code workflows. + +Raw data will be parsed into cleaned data--often using idiosyncratic or dataset-specific scripts--which is then processed into standardized data which can then be further parsed into published data products. Because this process results in potentially _many_ scripts, **coding reproducibly is vital to making this workflow intuitive and easy to maintain.** + +You don't necessarily need to follow all of the guidelines described below but in general, the more of these guidelines you follow the easier it will be to make needed edits, onboard new teammembers, maintain the workflow in the long term, and generally maximize the value of your work to yourself and others! + +

+Diagram depicting how raw data is transformed to cleaned data, then standardized data, and finally to published data products by a set of scripts between each 'type' of data +

Diagram of data stages from raw data to published products. Credit: Li Kui
+

+ ### Packages, Namespacing, and Software Versions An under-appreciated facet of reproducible coding is a record of what code packages are used in a particular script _and_ the version number of those packages. Packages evolve over time and code that worked when using one version of a given package may not work for future versions of that same package. Perpetually updating your code to work with the latest package versions **is not sustainable** but recording key information can help users set up the code environment that does work for your project.