Skip to content

Commit

Permalink
feat: embedding Li's data diagram with some small added context
Browse files Browse the repository at this point in the history
  • Loading branch information
njlyon0 committed Sep 12, 2024
1 parent d89636d commit 59e84af
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 0 deletions.
Binary file added images/image_data-stages.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions mod_reproducibility.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,19 @@ When your scripts are clear and reproducibly-written you will reap the following
3. Sharing methods for external result validation is more straightforward
4. In cases where you're developing a novel method or workflow, structuring your code in this way will increase the odds that someone outside of your team will adopt your strategy

### Code and the Stages of Data

You'll likely need a number of scripts to accomplish the different stages of preparing a synthesized dataset. All of these scripts together are often called a "workflow." Each script will meet a specific need and its outputs will be the inputs of the next script. These intermediary data products are sometimes useful in and of themselves and tend to occur and predictable points that exist in most code workflows.

Raw data will be parsed into cleaned data--often using idiosyncratic or dataset-specific scripts--which is then processed into standardized data which can then be further parsed into published data products. Because this process results in potentially _many_ scripts, **coding reproducibly is vital to making this workflow intuitive and easy to maintain.**

You don't necessarily need to follow all of the guidelines described below but in general, the more of these guidelines you follow the easier it will be to make needed edits, onboard new teammembers, maintain the workflow in the long term, and generally maximize the value of your work to yourself and others!

<p align="center">
<img src="images/image_data-stages.png" alt="Diagram depicting how raw data is transformed to cleaned data, then standardized data, and finally to published data products by a set of scripts between each 'type' of data" width="90%"/>
<figcaption>Diagram of data stages from raw data to published products. Credit: Li Kui</figcaption>
</p>

### Packages, Namespacing, and Software Versions

An under-appreciated facet of reproducible coding is a record of what code packages are used in a particular script _and_ the version number of those packages. Packages evolve over time and code that worked when using one version of a given package may not work for future versions of that same package. Perpetually updating your code to work with the latest package versions **is not sustainable** but recording key information can help users set up the code environment that does work for your project.
Expand Down

0 comments on commit 59e84af

Please sign in to comment.