Skip to content

Commit b479e77

Browse files
authored
one sentence per line, minimal changes
e.g. - consistency
1 parent 5100a3d commit b479e77

File tree

1 file changed

+13
-7
lines changed

1 file changed

+13
-7
lines changed

ten-simple-rules-dockerfiles.Rmd

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,14 +41,16 @@ author:
4141
abstract: |
4242
Computational science has been greatly improved by the use of containers for packaging software and data dependencies.
4343
In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow's reproducibility can be greatly affected by the choices that are made with respect to building containers.
44-
In many cases, the build process for the container's image is created from instructions provided in a `Dockerfile` format. In support of this approach, we present a set of rules to help researchers write understandable `Dockerfile`s for typical data science workflows.
44+
In many cases, the build process for the container's image is created from instructions provided in a `Dockerfile` format.
45+
In support of this approach, we present a set of rules to help researchers write understandable `Dockerfile`s for typical data science workflows.
4546
By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows.
4647
author_summary: |
4748
Computers and algorithms are ubiquitous in research.
4849
Therefore, defining the computing environment, i.e., the body of all software used directly or indirectly by a researcher, is important, because it allows other researchers to recreate the environment to understand, inspect, and reproduce an analysis.
4950
A helpful abstraction for capturing the computing environment is a _container_, whereby a container is created from a set of instructions in a recipe.
5051
For the most common containerisation software, Docker, this recipe is called a Dockerfile.
51-
We believe that in a scientific context, researchers should follow specific practices for writing a Dockerfile. These practices might be somewhat different from the practices of generic software developers in that researchers often need to focus on transparency and understandability rather than performance considerations.
52+
We believe that in a scientific context, researchers should follow specific practices for writing a Dockerfile.
53+
These practices might be somewhat different from the practices of generic software developers in that researchers often need to focus on transparency and understandability rather than performance considerations.
5254
The rules presented here are intended to help researchers, especially newcomers to containerisation, leverage containers for open and effective scholarly communication and collaboration while avoiding the pitfalls that are especially irksome in a research lifecycle.
5355
The recommendations cover a deliberate approach to Dockerfile creation, formatting and style, documentation, and habits for using containers.
5456
bibliography: bibliography.bib
@@ -87,7 +89,9 @@ Approaches such as containerisation are needed to support computational research
8789

8890
Containerisation helps provide instructions for packaging the building blocks of computer-based research (i.e., code, data, documentation, and the computing environment).
8991
Specifically, containers are built from plain text files that represent a human- _and_ machine-readable recipe for creating the computing environment and interacting with data.
90-
By providing this recipe, authors of scientific articles greatly improve their work's level of documentation, transparency, and reusability. This is an important part of common practice for scientific computing [@wilson_best_2014; @wilson_good_2017]. An overall goal of these practices is to ensure that both the author and others are able to reproduce and extend an analysis workflow.
92+
By providing this recipe, authors of scientific articles greatly improve their work's level of documentation, transparency, and reusability.
93+
This is an important part of common practice for scientific computing [@wilson_best_2014; @wilson_good_2017].
94+
An overall goal of these practices is to ensure that both the author and others are able to reproduce and extend an analysis workflow.
9195
The containers built from these recipes are portable encapsulated snapshots of a specific computing environment that are both more lightweight and transparent than virtual machines.
9296
Such containers have been demonstrated for capturing scientific notebooks [@rule_ten_2019] and reproducible workflows [@sandve_ten_2013].
9397

@@ -101,7 +105,7 @@ knitr::include_graphics("summary.png")
101105
# Prerequisites & scope
102106

103107
To start with, we assume the existence of a scripted scientific workflow, i.e. you can, at least at a certain point in time, execute the full process with a fixed set of commands, for example `make prepare_data` followed by `Rscript analysis.R`, or only `python3 my-workflow.py`.
104-
To maximise reach, we assume that containers that you eventually share with others can only run open source software; tools like Mathematica and Matlab are out of scope for this example.
108+
To maximise reach, we assume that containers, which you eventually share with others, can only run open source software; tools like Mathematica and Matlab are out of scope for this example.
105109
A workflow that does not support scripted execution is also out of scope for reproducible research, as it does not fit well with containerisation.
106110
Furthermore, workflows interacting with many petabytes of data and executed in high-performance computing (HPC) infrastructures are out of scope.
107111
Using such HPC job managers or cloud infrastructures would require a collection of "Ten Simple Rules" articles in their own right.
@@ -132,7 +136,8 @@ Docker [@wikipedia_contributors_docker_2019] is a container technology that has
132136
Containers are distinct from virtual machines or hypervisors, as they do not emulate hardware or operating system kernels and hence do not require the same system resources.
133137
Several solutions for facilitating reproducible research are built on top of containers [@brinckman_computing_2018; @code_ocean_2019; @simko_reana_2019; @jupyter_binder_2018; @nust_opening_2017], but these solutions intentionally hide most of the complexity from the researcher.
134138

135-
To create Docker containers for specific workflows, we write text files that follow a particular format called `Dockerfile` [@docker_inc_dockerfile_2019]. A `Dockerfile` is a machine- _and_ human-readable recipe for building images, comparable to a `Makefile` [@wikipedia_contributors_make_2019].
139+
To create Docker containers for specific workflows, we write text files that follow a particular format called `Dockerfile` [@docker_inc_dockerfile_2019].
140+
A `Dockerfile` is a machine- _and_ human-readable recipe for building images, comparable to a `Makefile` [@wikipedia_contributors_make_2019].
136141
Here, container images include the application, e.g., the programming language interpreter needed to run a workflow, and the system libraries required by an application to run.
137142
Thus, a `Dockerfile` consists of a sequence of instructions to copy files and install software.
138143
Each instruction adds a layer to the image, which can be cached across image builds for minimizing build and download times.
@@ -554,7 +559,8 @@ Mounting these files is preferable to using the `ADD`/`COPY` instructions in the
554559
If you want to add local files to the container, (and do not need [`ADD`'s extra features](https://docs.docker.com/engine/reference/builder/#add)) we recommend `COPY` because it is simpler and explicit.
555560
Volumes are useful for persisting changes across runs of a container and offer faster file I/O compared to other mounting methods (particularly useful with databases for example).
556561
However they are less suitable for reproducibility, since these changes exist within the image (making them less in line with treating containers as ephemeral see \ruleref{rule:usage}) and are not so easy to access or place under version control.
557-
Unless specific features are needed, bind mounts are preferable to [storage volumes](https://docs.docker.com/storage/volumes/) since the contents are directly accessible from both the container and the host. The files can also be more easily included in the same repository.
562+
Unless specific features are needed, bind mounts are preferable to [storage volumes](https://docs.docker.com/storage/volumes/) since the contents are directly accessible from both the container and the host.
563+
The files can also be more easily included in the same repository.
558564

559565
Storing _data files_ outside of the container allows handling of very large or sensitive datasets, e.g., proprietary data or private information.
560566
Do not include such data in an image!
@@ -753,7 +759,7 @@ Third, you can export the image to file and deposit it in a public data reposito
753759
You should include instructions for how to import and run the workflow based on the image archive and add your own image tags using semantic versioning (see \ruleref{rule:base}) for clarity.
754760
Depositing the image next to other project files, i.e., data, code, and the used `Dockerfile`, in a public repository makes them likely to be preserved, but it is highly unlikely that over time you will be able to recreate it precisely from the accompanying `Dockerfile`.
755761
Publishing the image and the contained metadata therein (e.g., the Docker version used) may even allow future science historians to emulate the Docker environment.
756-
Sharing the actual image via a registry and a version-controlled `Dockerfile` together allows you to freely experiment and continue developing your workflow and keep the image up to date, e.g. updating versions of pinned dependencies (see \ruleref{rule:pinning}) and regular image building (see above).
762+
Sharing the actual image via a registry and a version-controlled `Dockerfile` together allows you to freely experiment and continue developing your workflow and keep the image up to date, e.g., updating versions of pinned dependencies (see \ruleref{rule:pinning}) and regular image building (see above).
757763

758764
Finally, for a sanity check and to foster even higher trust in the stability and documentation of your project, you can ask a colleague or community member to be your code copilot (see [https://twitter.com/Code_Copilot](https://twitter.com/Code_Copilot)) to interact with your workflow container on a machine of their own.
759765
You can do this shortly before submitting your reproducible workflow for peer-review, so you are well positioned for the future of scholarly communication and open science, where these may be standard practices required for publication [@eglen_codecheck_2019; @chen_open_2019; @schonbrodt_training_2019; @eglen_recent_2018].

0 commit comments

Comments
 (0)