Welcome to BCEM!
We’ve compiled this tutorial to share BCEM's reproducibility standards so that we can better document what we do, for the sake of our future selves, our collaborators, and ultimately, a world with better science. Even though reproducible standards in the field of bioinformatics go well beyond the requirements established here (see a fully reproducible paper as an example), we have decided to take things slow and adopt steps that are manageable and realistic for researchers in our group. As we grow and develop our skills, we will move in that direction, keeping those examples as a North Star. For now, let's delve into the areas we're currently covering.
This tutorial covers the following aspects around research data management:
- Project management
- Data storage
- File structure
- Naming conventions
- Shared resource usage
- Scripts conventions
- Version control
Every member in the lab needs to set up a project folder in a Cloud service of choice (Google Drive, OneDrive, Dropbox) in agreement with all project collaborators. This folder serves an essential purpose: to store the most important documents related to the project that will allow all parties to understand the project's developments. There is a suggested file structure and naming conventions for all files in this folder (see sections XXX and YYY, below).
At first glance, you may wonder why this step is necessary, and one of the first one in the guide. After all, you may have joined us thinking about getting your hands dirty with data. But, sometimes, it is a good idea to slow down, consider where to look and what to look for. At first, it may seem that nothing is happening. In time, you'll realize this is a helpful roadmap – one that is meant to develop with your understanding of your project and the hints you receive from your results.
The most important aspect about the mind map is that you identify the key components that will help you find answers on your objectives. Here are some guideline questions to aid that process:
- Data acquisition: How? Where? Is it an experimental project, or are you downloading the data from open databases? In either case, be very explicit on the sources of data.
- Data processing: How? Which tools?
- Data analysis: What are the expected results? Which techniques could help me reach my goals?
- Visualization: Which results are relevant? What are resulting analysis showing?
- Final results: Where are these stored?
We suggest using a service such as diagrams.net to produce the actual map. But that's not binding, you may use any one that gets you there. Ideally, at every step of the way, the mind map should be up-to-date in your project folder in the Cloud service chosen.
Here's an example for a project based on experimental data collection:
Here's an example for a project based on data acquired from public databases:
This is a mandatory piece of documentation accompanying all data sets used in the project that details the source and the process of data acquisition and processing. tos. We abide by the standards on Mininum Information about a Genome Sequence (MIGS), which are already adopted by specific repositories of genome sequence data such as the European Nucleotide Archive (ENA).
Raw data must be stored under our lab's ENA account immediately upon receival. The guidelines for submission are as follows:
Our lab requires that any process of data acquisition and/or processing be properly documented so that the work is as transparent and reproducible as possible. There are two possibilities for this documentation process, using an Electronic Lab Notebook (ELN), specifically RSpace (our Lab has a centralized account to manage multiple projects by all members with this provider), or digitally keeping detailed logs in a MarkDown (.md) document. Suggested tools to this end are Jupyter notebooks, Zettlr, Typora. The platform does not matter as long as it is a MarkDown document.
These documents must:
o Have one per flowchart (mind map) component o Contain the following sections for each entry: •Date •Aim •Protocol followed •Command lines or methodology in the lab •Third-party software (description of how it was used, under what parameters, includelink to the tutorial(s)) • Results • Must include relevant tables, graphs, etc. (or links to where these are stored, in case of large files) • Must be commented (interpretations of what has been found) • Indication of where the (intermediate) data was deposited (path, link).
This is the suggested (required?) file structure for the folder ...
- 01_Quality
- 02_Trimming
- 03_Quality_Trimming
- 04_Assembly
- 05_Results_Figures
- 06_Results_Tables
- 07_Manuscript
These are the conventions adopted by our lab to ensure as much as possible an understanding of what is contained in a file: …
The minimum requirements for a script include:
- The name of the file must be consistent with the function implemented.
- Adapt to a standard of mnemonics and notation (Notation camel):
- For example: NotationCamel
- Name
- Description
- Author
- Institution
- Contact email
- Date: When was it implemented
- Help (input, output) - how to run
- Requirements (codependencies) versions
There must be one README per project module
- Version
- Parameters
- Information needed before
- Order in which the script should be put
- Data structure (input and output)
- Dependencies (versions)
- TYPORA
- Results graphs (if applicable)
Here's a repository containing an example of an ideal script and its associated README file:
Note that this is not a requirement yet, advanced users only. Link to a tutorial?
- Each commit must be adequately described: consistent, without omitting information
- DO NOT commit on incomplete or unstable versions of the script
- Teamwork (create n work branches, work on the branch that corresponds, do merge, do push)
- Execute push only on the main work branch.
- Only push to the master branch once ...(?)
- Main folder will be the project with a README and a workflow
- Within each project there are modules and each module folder must contain a README file
Note: !
Add text here.