This repository is an example of genomics data analysis with open science in mind. It can be used as starting point. The sections belows presents best practices to make your data analysis more reproducible.
Write a Bash script that downloads any resource files, such as reference genomes and databases, to documentent file versions and locations. Also, it automates analysis environment preparation.
bash download_files.shProvide files decribing data processing workflows (WDL, CWL, Nextflow) and input files (normally JSON). In this example we use QA workflow available at https://github.com/labbcb/workflows version 1.1.0. Provide the command to execute the workflow.
wget https://github.com/broadinstitute/cromwell/releases/download/54/cromwell-54.jar
java -jar cromwell-54.jar run --options options.json --inputs inputs/qc.inputs.json workflows/qc.wdlUse .gitignore file to skip downloaded or generated files that you do not want to keep in the repository.
Examples: input data, Cromwell and temporary files.
Use rocker or Bioconductor as base images to make your own Docker image containing all required packages and software libraries to run data analysis. Keep Docker-related files separated from other files to reduce Docker build context. Do not add data file to Docker image. Tag your Docker image according to your analysis version.
Build Docker image:
docker build -t reproducible-analysis:1.1.0 dockerRun RStudio:
docker container run \
--rm \
--detach \
-e DISABLE_AUTH=true \
--volume $PWD:/home/rstudio \
--publish 8787:8787 \
-e USERID=`id -u` \
-e GROUPID=`id -g`\
reproducible-analysis:1.1.0RStudio Server will be available at http://localhost:8787.
Replace
-e DISABLE_AUTH=truewith-e PASSWORD=secretto set a password.
Compile RMarkdown file without running RStudio:
docker container run \
--rm \
--volume $PWD:/home/rstudio \
--user `id -u`:`id -g` \
-w /home/rstudio \
reproducible-analysis:1.1.0 \
R -e "rmarkdown::render('data-analysis.Rmd')"Using Docker compose.
docker-compose upRStudio Server will be available at http://localhost:8787.
Add
--buildto force Docker to rebuild image.
Add
-dto detach process from terminal. Rundocker-compose downto stop service.
Use semantic versioning and GitHub releases.
For example, x.y.z, where:
xmajor version of data analysis. Change only some software is replaced such as sequence aligner.yminor version. Change when some software is updated to newer version.zpatch version. Change when you have found some bug or typos.
Data should be publish to public repository, such as NCBI GEO and NCBI, according to the type of data.
Repository are meant to be private until paper describing the analysis is published.