diff --git a/.nojekyll b/.nojekyll index 1852650..1e7b34d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -fd1fedfc \ No newline at end of file +fe2effe4 \ No newline at end of file diff --git a/mod_reproducibility.html b/mod_reproducibility.html index 2a108e6..28eb5bb 100644 --- a/mod_reproducibility.html +++ b/mod_reproducibility.html @@ -378,13 +378,20 @@

Lego Activity

Project Organization & Documentation

-

Much of the popular conversation around reproducibility centers on reproducibility as it pertains to code. That is definitely an important facet but before we write even a single line it is critically important that we first discuss what factors go into project-wide reproducibility. “Perfect” code in a project that isn’t structured thoughtfully can still result in a synthesis project that is not reproducible while even “bad” code can be made more intelligible when it is placed in a well-documented/organized project!

+

Much of the popular conversation around reproducibility centers on reproducibility as it pertains to code. That is definitely an important facet but before we write even a single line it is vital to consider project-wide reproducibility. “Perfect” code in a project that isn’t structured thoughtfully can still result in a project that isn’t reproducible. On the other hand, “bad” code can be made more intelligible when it is placed in a well-documented/organized project!

Folder Structure

-

The simplest and often most effective way of beginning a reproducible project is adopting (and sticking to) a good file organization system. There is no single “best” way of organizing your projects’ files as long as you are consistent. Consistency allows those navigating your system to deduce where particular files are likely to be without having in-depth knowledge of the entire suite of materials.

One stick figure looks in despair at anther's computer where many badly named files are present. At the bottom text reads 'protip: never look in someone else's documents folder'

-

To begin, it is best to have a single folder for each project. This makes it simple to find the project’s inputs and outputs and also makes collaboration and documentation much cleaner. Later in your project’s life cycle, this ‘one folder’ approach will also make it easier to share your project with external reviewers or new team members. For researchers used to working alone there can be a temptation to think about your leadership of a project as the fundamental unit rather than the individual projects’ scopes. This method works fine when working alone but greatly increases the difficulty of communication and co-working in projects led by teams. RStudio (the primary Integrated Developer Environment for R) and most version control systems assume that each project’s materials will be placed in a single folder and either of these systems can confer significant benefits to your work (well worth any potential reorganization difficulty).

-

Within your project folder, it is valuable to structure your folders and files hierarchically. Having a folder with dozens of mixed file types of various purposes that may be either inputs or outputs is cumbersome to document and difficult to navigate. If instead you adopt a system of sub-folders that group files based on purpose and/or source engagement becomes much simpler. You need not use an intricate web of sub-folders either; often just a single layer of these sub-folders provides sufficient structure to meet your project’s organizational needs.

+

The simplest way of beginning a reproducible project is adopting a good file organization system. There is no single “best” way of organizing your projects’ files as long as you are consistent. Consistency will make your system–whatever that consists of–understandable to others.

+

Here are some rules to keep in mind as you decide how to organize your project:

+
    +
  1. Use one folder per project
  2. +
+

Keeping all inputs, outputs, and documentation in a single folder makes it easier to collaborate and share all project materials. Also, most programming applications (RStudio, VS Code, etc.) work best when all needed files are in the same folder.

+
    +
  1. Organize content with sub-folders
  2. +
+

Putting files that share a purpose or source into logical sub-folders is a great idea! This makes it easy to figure out where to put new content and reduces the effort of documenting project organization. Don’t feel like you need to use an intricate web of sub-folders either! Just one level of sub-folders is enough for many projects.

diff --git a/search.json b/search.json index f04b120..96151a6 100644 --- a/search.json +++ b/search.json @@ -381,7 +381,7 @@ "href": "mod_reproducibility.html#project-organization-documentation", "title": "Reproducibility Best Practices", "section": "Project Organization & Documentation", - "text": "Project Organization & Documentation\nMuch of the popular conversation around reproducibility centers on reproducibility as it pertains to code. That is definitely an important facet but before we write even a single line it is critically important that we first discuss what factors go into project-wide reproducibility. “Perfect” code in a project that isn’t structured thoughtfully can still result in a synthesis project that is not reproducible while even “bad” code can be made more intelligible when it is placed in a well-documented/organized project!\n\nFolder Structure\nThe simplest and often most effective way of beginning a reproducible project is adopting (and sticking to) a good file organization system. There is no single “best” way of organizing your projects’ files as long as you are consistent. Consistency allows those navigating your system to deduce where particular files are likely to be without having in-depth knowledge of the entire suite of materials.\n\nTo begin, it is best to have a single folder for each project. This makes it simple to find the project’s inputs and outputs and also makes collaboration and documentation much cleaner. Later in your project’s life cycle, this ‘one folder’ approach will also make it easier to share your project with external reviewers or new team members. For researchers used to working alone there can be a temptation to think about your leadership of a project as the fundamental unit rather than the individual projects’ scopes. This method works fine when working alone but greatly increases the difficulty of communication and co-working in projects led by teams. RStudio (the primary Integrated Developer Environment for R) and most version control systems assume that each project’s materials will be placed in a single folder and either of these systems can confer significant benefits to your work (well worth any potential reorganization difficulty).\nWithin your project folder, it is valuable to structure your folders and files hierarchically. Having a folder with dozens of mixed file types of various purposes that may be either inputs or outputs is cumbersome to document and difficult to navigate. If instead you adopt a system of sub-folders that group files based on purpose and/or source engagement becomes much simpler. You need not use an intricate web of sub-folders either; often just a single layer of these sub-folders provides sufficient structure to meet your project’s organizational needs.\n\n\n\n\n\n\nDiscussion: Folder Structure\n\n\n\nWith a partner discuss (some of) the following questions:\n\nHow do you typically organize your projects’ files?\nWhat benefits do you see of your current approach?\nWhat–if any–limitations to your system have you experienced?\nDo you think your structure would work well in a team environment?\n\nIf not, what changes might you make to better fit that context?\n\n\n\n\n\n\nFile Names\nBeyond the structure and degree of nestedness you adopt for your folders, your files can (and should) include a lot of helpful contextual information about themselves. An ideal file name should be very informative about that file’s contents, purpose, and relation to other project files. Some or all of that information may be reinforced by the folder(s) in which the file is placed, but the file name itself should also confer that information. This may feel redundant but if late in your project’s lifecycle you decide a different folder system is needed, information-dense file names will allow you to change file locations without excessive difficulty.\nYou should also consider how ‘machine readable’ your file names are. One fundamental way in which this changes user’s experience is how file management applications (e.g., Apple’s Finder) visually display files. By default files are typically sorted alphabetically and numerically. So, even if the script “wrangle.R” should be run first in your workflow, most file explorers would put that script last or at the bottom. If instead you changed it’s name to “01_wrangle.R” now it would likely be sorted towards the top and encountered earlier by those interested in your workflow. Notice too in that example that we have “zero padded” the script so that if we eventually had a tenth script file explorers would correctly sort it (“10…” would be before “1…” in most file sorting systems).\nYou should also avoid spaces and accented characters (e.g., é, ü, etc.) as some computers will not be able to recognize these characters. Windows operating systems in particular have a very difficult time parsing folder names with spaces (e.g., “raw data” versus “raw_data”). Using a mix of upper and lowercase letters can be effective when done carefully but also requires a lot of attention to detail on the part of those creating new files. It may be simplest to stick with all lowercase or all uppercase for your file names.\nBe consistent with any delimiters you use in file names! Two common ones are the hyphen (-) and underscore (_). If you use one instead of spaces, be sure to only use that one for that use-case rather than using them interchangeably. You may find it useful to use one delimiter to separate a type of information and then the other in lieu of spaces. For example, “fxn_calc-diversity.R” uses the prefix “fxn_” to indicate that the script contains a function while the words to the right of the underscore briefly describe the purpose of that function.\nIn that same vein, you may want to consider using “slugs” in your file names. Slugs are human-readable, unique pieces of file names that are shared between files and the outputs that they create. For example, the files created by “01_wrangle.R” could all begin with “01_” (the slug in this case). The benefit of this approach is that diagnosing strange outputs–or simply finding the source of a given file or graph–is a straightforward matter of looking for the matching slug.\n\n\nDocumentation\nDocumenting a project can feel like a Sisyphean task but it is often not as hard as one might imagine and well worth the effort! One simple practice you can adopt to dramatically improve the reproducibility of your project is to create a “README” file in the top-level of your project’s folder system. This file can be formatted however you’d like but generally READMEs should include (1) a project overview written in plain language, (2) a basic table of contents for the primary folders in your project folder, and (3) a brief description of the file naming scheme you’ve adopted for this project.\nYour project’s README becomes the ‘landing page’ for those navigating your repository and makes it easy for team members to know where documentation should go (in the README!). You may also choose to create a README file for some of the sub-folders of your project. This can be particularly valuable for your “data” folder(s) as it is an easy place to store data source/provenance information that might be overwhelming to include in the project-level README file.\nFinally, you should choose a place to keep track of ideas, conversations, and decisions about the project. While you can take notes on these topics on a piece of paper, adopting a digital equivalent is often helpful because you can much more easily search a lengthy document when it is machine readable. We will discuss GitHub during the Version Control module but GitHub offers something called Issues that can be a really effective place to record some of this information.\n\n\nOrganization Recommendations\nIf you integrate any of the concepts we’ve covered above you will find the reproducibility and transparency of your project will greatly increase. However, if you’d like additional recommendations we’ve assembled a non-exhaustive set of additional “best practices” that you may find helpful.\n\nNever Edit Raw Data\nFirst and foremost, it is critical that you never edit the raw data directly. If you do need to edit the raw data, use a script to make all needed edits and save the output of that script as a separate file. Editing the raw data directly without a script or using a script but overwriting the raw data are both incredibly risky operations because your create a file that “looks” like the raw data (and is likely documented as such) but differs from what others would have if they downloaded the ‘real’ raw data personally.\n\n\nSeparate Raw and Processed Data\nIn the same vein as the previous best practice, we recommend that you separate the raw and processed data into separate folders. This will make it easier to avoid accidental edits to the raw data and will make it clear what data are created by your project’s scripts; even if you choose not to adopt a file naming convention that would make this clear.\n\n\nQuarantine External Outputs\nThis can sound harsh, but it is often a good idea to “quarantine” outputs received from others until they can be carefully vetted. This is not at all to suggest that such contributions might be malicious! As you embrace more of the project organization recommendations we’ve described above outputs from others have more and more opportunities to diverge from the framework you establish. Quarantining inputs from others gives you a chance to rename files to be consistent with the rest of your project as well as make sure that the style and content of the code also match (e.g., use or exclusion of particular packages, comment frequency and content, etc.)", + "text": "Project Organization & Documentation\nMuch of the popular conversation around reproducibility centers on reproducibility as it pertains to code. That is definitely an important facet but before we write even a single line it is vital to consider project-wide reproducibility. “Perfect” code in a project that isn’t structured thoughtfully can still result in a project that isn’t reproducible. On the other hand, “bad” code can be made more intelligible when it is placed in a well-documented/organized project!\n\nFolder Structure\n\nThe simplest way of beginning a reproducible project is adopting a good file organization system. There is no single “best” way of organizing your projects’ files as long as you are consistent. Consistency will make your system–whatever that consists of–understandable to others.\nHere are some rules to keep in mind as you decide how to organize your project:\n\nUse one folder per project\n\nKeeping all inputs, outputs, and documentation in a single folder makes it easier to collaborate and share all project materials. Also, most programming applications (RStudio, VS Code, etc.) work best when all needed files are in the same folder.\n\nOrganize content with sub-folders\n\nPutting files that share a purpose or source into logical sub-folders is a great idea! This makes it easy to figure out where to put new content and reduces the effort of documenting project organization. Don’t feel like you need to use an intricate web of sub-folders either! Just one level of sub-folders is enough for many projects.\n\n\n\n\n\n\nDiscussion: Folder Structure\n\n\n\nWith a partner discuss (some of) the following questions:\n\nHow do you typically organize your projects’ files?\nWhat benefits do you see of your current approach?\nWhat–if any–limitations to your system have you experienced?\nDo you think your structure would work well in a team environment?\n\nIf not, what changes might you make to better fit that context?\n\n\n\n\n\n\nFile Names\nBeyond the structure and degree of nestedness you adopt for your folders, your files can (and should) include a lot of helpful contextual information about themselves. An ideal file name should be very informative about that file’s contents, purpose, and relation to other project files. Some or all of that information may be reinforced by the folder(s) in which the file is placed, but the file name itself should also confer that information. This may feel redundant but if late in your project’s lifecycle you decide a different folder system is needed, information-dense file names will allow you to change file locations without excessive difficulty.\nYou should also consider how ‘machine readable’ your file names are. One fundamental way in which this changes user’s experience is how file management applications (e.g., Apple’s Finder) visually display files. By default files are typically sorted alphabetically and numerically. So, even if the script “wrangle.R” should be run first in your workflow, most file explorers would put that script last or at the bottom. If instead you changed it’s name to “01_wrangle.R” now it would likely be sorted towards the top and encountered earlier by those interested in your workflow. Notice too in that example that we have “zero padded” the script so that if we eventually had a tenth script file explorers would correctly sort it (“10…” would be before “1…” in most file sorting systems).\nYou should also avoid spaces and accented characters (e.g., é, ü, etc.) as some computers will not be able to recognize these characters. Windows operating systems in particular have a very difficult time parsing folder names with spaces (e.g., “raw data” versus “raw_data”). Using a mix of upper and lowercase letters can be effective when done carefully but also requires a lot of attention to detail on the part of those creating new files. It may be simplest to stick with all lowercase or all uppercase for your file names.\nBe consistent with any delimiters you use in file names! Two common ones are the hyphen (-) and underscore (_). If you use one instead of spaces, be sure to only use that one for that use-case rather than using them interchangeably. You may find it useful to use one delimiter to separate a type of information and then the other in lieu of spaces. For example, “fxn_calc-diversity.R” uses the prefix “fxn_” to indicate that the script contains a function while the words to the right of the underscore briefly describe the purpose of that function.\nIn that same vein, you may want to consider using “slugs” in your file names. Slugs are human-readable, unique pieces of file names that are shared between files and the outputs that they create. For example, the files created by “01_wrangle.R” could all begin with “01_” (the slug in this case). The benefit of this approach is that diagnosing strange outputs–or simply finding the source of a given file or graph–is a straightforward matter of looking for the matching slug.\n\n\nDocumentation\nDocumenting a project can feel like a Sisyphean task but it is often not as hard as one might imagine and well worth the effort! One simple practice you can adopt to dramatically improve the reproducibility of your project is to create a “README” file in the top-level of your project’s folder system. This file can be formatted however you’d like but generally READMEs should include (1) a project overview written in plain language, (2) a basic table of contents for the primary folders in your project folder, and (3) a brief description of the file naming scheme you’ve adopted for this project.\nYour project’s README becomes the ‘landing page’ for those navigating your repository and makes it easy for team members to know where documentation should go (in the README!). You may also choose to create a README file for some of the sub-folders of your project. This can be particularly valuable for your “data” folder(s) as it is an easy place to store data source/provenance information that might be overwhelming to include in the project-level README file.\nFinally, you should choose a place to keep track of ideas, conversations, and decisions about the project. While you can take notes on these topics on a piece of paper, adopting a digital equivalent is often helpful because you can much more easily search a lengthy document when it is machine readable. We will discuss GitHub during the Version Control module but GitHub offers something called Issues that can be a really effective place to record some of this information.\n\n\nOrganization Recommendations\nIf you integrate any of the concepts we’ve covered above you will find the reproducibility and transparency of your project will greatly increase. However, if you’d like additional recommendations we’ve assembled a non-exhaustive set of additional “best practices” that you may find helpful.\n\nNever Edit Raw Data\nFirst and foremost, it is critical that you never edit the raw data directly. If you do need to edit the raw data, use a script to make all needed edits and save the output of that script as a separate file. Editing the raw data directly without a script or using a script but overwriting the raw data are both incredibly risky operations because your create a file that “looks” like the raw data (and is likely documented as such) but differs from what others would have if they downloaded the ‘real’ raw data personally.\n\n\nSeparate Raw and Processed Data\nIn the same vein as the previous best practice, we recommend that you separate the raw and processed data into separate folders. This will make it easier to avoid accidental edits to the raw data and will make it clear what data are created by your project’s scripts; even if you choose not to adopt a file naming convention that would make this clear.\n\n\nQuarantine External Outputs\nThis can sound harsh, but it is often a good idea to “quarantine” outputs received from others until they can be carefully vetted. This is not at all to suggest that such contributions might be malicious! As you embrace more of the project organization recommendations we’ve described above outputs from others have more and more opportunities to diverge from the framework you establish. Quarantining inputs from others gives you a chance to rename files to be consistent with the rest of your project as well as make sure that the style and content of the code also match (e.g., use or exclusion of particular packages, comment frequency and content, etc.)", "crumbs": [ "Quantitative Modules", "Reproducibility" diff --git a/sitemap.xml b/sitemap.xml index 92fef06..f8958f5 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,78 +2,78 @@ https://lter.github.io/ssecr/mod_version-control.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_thinking.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_wrangle.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_credit.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_findings.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_data-disc.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/topic_interactivity.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_facilitation.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_data-viz.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_team-sci.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/index.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_reports.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/topic_spatial.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_stats.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_reproducibility.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_spatial.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/mod_logic-models.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z https://lter.github.io/ssecr/CONTRIBUTING.html - 2024-02-22T16:58:34.666Z + 2024-02-22T17:59:03.586Z https://lter.github.io/ssecr/mod_project-mgmt.html - 2024-02-22T16:58:34.686Z + 2024-02-22T17:59:03.606Z