diff --git a/.nojekyll b/.nojekyll index 95dc03c..382f55d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -8c0d72f4 \ No newline at end of file +27a79fc9 \ No newline at end of file diff --git a/mod_reproducibility.html b/mod_reproducibility.html index 0510067..c5d1ddc 100644 --- a/mod_reproducibility.html +++ b/mod_reproducibility.html @@ -278,7 +278,14 @@
The simplest and often most effective way of beginning a reproducible project is adopting (and sticking to) a good file organization system. There is no single “best” way of organizing your projects’ files as long as you are consistent. Consistency allows those navigating your system to deduce where particular files are likely to be without having in-depth knowledge of the entire suite of materials.
- +To begin, it is best to have a single folder for each project. This makes it simple to find the project’s inputs and outputs and also makes collaboration and documentation much cleaner. Later in your project’s life cycle, this ‘one folder’ approach will also make it easier to share your project with external reviewers or new team members. For researchers used to working alone there can be a temptation to think about your leadership of a project as the fundamental unit rather than the individual projects’ scopes. This method works fine when working alone but greatly increases the difficulty of communication and co-working in projects led by teams. RStudio (the primary Integrated Developer Environment for R) and most version control systems assume that each project’s materials will be placed in a single folder and either of these systems can confer significant benefits to your work (well worth any potential reorganization difficulty).
Within your project folder, it is valuable to structure your folders and files hierarchically. Having a folder with dozens of mixed file types of various purposes that may be either inputs or outputs is cumbersome to document and difficult to navigate. If instead you adopt a system of sub-folders that group files based on purpose and/or source engagement becomes much simpler. You need not use an intricate web of sub-folders either; often just a single layer of these sub-folders provides sufficient structure to meet your project’s organizational needs.
One of the first things that every script should begin with is an explicit loading of all libraries that script need (these are called “dependencies). Scripts that don’t specify which libraries are needed are unlikely to run on anyone’s computer. Unfortunately, many R packages need to be installed by each user before they can be loaded with the library
function. You may find it simpler to use the librarian package which automatically detects and installs needed packages if they are not already present. Note that users would still need to install librarian itself!
It is also strongly recommended to “namespace” functions everywhere you use them. In R this is technically optional (Python requires this) but it is a really good practice to adopt, particularly for functions that may appear in multiple packages with the same name but do very different operations depending on their source. Namespacing in R is done by adding the package name and two colons before the function name (e.g., dplyr::mutate
). This prevents accidental use of functions from the ‘wrong’ package for a given context.
You may also need to consider the version of the packages that you’re using and the version of R. The sessionInfo
function (from the utils package loaded into R by default) is a good way of capturing some of this information but it is relatively high level and lacks sufficient detail for many contexts. For a more complete amount of information, consider using the renv or packrat packages.
Every change to the data between the initial raw data and the finished product should be scripted. The ideal would be that you could hand someone your code and the starting data and have them be able to perfectly retrace your steps. This is not possible if you make unscripted modifications to the data at any point!
+You may wish to break your scripted workflow into separate, modular files for ease of maintenance and/or revision. This is a good practice so long as each file fits clearly into a logical/thematic group (e.g., data cleaning versus analysis).
+Finally, your code should never use absolute file paths. Absolute file paths are those that begin at the root of your entire computer (“C:…” on Windows and “~…” on Mac). Such paths are inherently not reproducible as the odds of anyone having the exact same absolute file path are extremely slim. Instead, using relative file paths that begin at the project folder is preferable. These are transferable among users. You can even use R’s file.path
function to automatically detect the correct direction of slashes between folders to make it easier to collaborate across operating systems! Note in the above figure from Trisovic et al. (2022) that many scripts that set the working directory manually had errors until that bit was removed. Avoid setting the working directory explicitly and instead structure your project such that relative paths within the project folder will always succeed.
When it comes to code style, the same ‘rule of thumb’ applies here that applied to project organization: virtually any system will work so long as you (and your team) are consistent! Thtat said, there are a few principles worth adopting if you have not already done so.
+1. Use concise and descriptive object names
+It can be difficult to balance these two imperatives but short object names are easier to re-type and visually track through a script. Descriptive object names on the other hand are useful because they help orient people reading the script to what the object contains.
+2. Don’t be afraid of space!
+Scripts are free to write regardless of the number of lines so do not feel as though there is a strict character limit you need to keep in mind. Cramped code is difficult to read and thus can be challenging to share with others or debug on your own. Inserting an empty line between coding lines can help break up sections of code and putting spaces before and after operators can make reading single lines much simpler.