General information:
- Introduction
- List of pipeline steps
- Pipeline and database initialization
- Taxa, genomes, ontologies (e.g., Uberon)
- Raw expression data analyses and insertion: RNA-Seq, Affymetrix, in situ hybridization, EST analyses.
- Bgee post-processing steps
Shortcut note: for the RNA-Seq analysis pipeline, see RNA_Seq/.
Developer guidelines
Through all the documentation, RELEASE
will denote the current Bgee version
(e.g., if the current release number is 15
, bgee_vRELEASE
means bgee_v15
).
Each step in the Bgee pipeline is represented by a specific folder, containing a Makefile, and related scripts. Variables common to several steps are defined in the file pipeline/Makefile.common. Sensitive variables are stored in the file pipeline/Makefile.Config.
Each Makefile ultimately generates an output file, called step_verification_RELEASE.txt
, in the corresponding output folder.
This file is generated for the Makefile to determine whether a step should be re-run, and for developers to control that the step was correctly executed.
These files are committed to git, so that results can be compared between releases.
They are not meant to be the output of the Makefiles, but, rather, small files to be added to git, and to served as control of the procedures.
- Pipeline initialization: see init/.
- Database creation: see db_creation/.
- Species and taxon information: see species/.
- Genomes and gene-related information: see genes/.
- Anatomical ontology (Uberon) and developmental stage ontologies: see uberon/.
- RNA-Seq data analyses: see RNA_Seq/.
- Affymetrix data analyses: see Affymetrix/.
- In situ hybridization data analyses: see In_situ/.
- EST data analyses: see ESTs/.
- Differential expression analyses: see Differential_expression/.
- Annotation sanity checks: see post_processing/.
- Propagation/reconciliation of present/absent expression calls: see post_processing/.
- Computations of expression rank scores: see post_processing/.
- Generation of files containing data available for download: see download_files/.
- Generation of XRefs to Uniprot: see download_files/.
- Insertion of information about versions of the data sources used: see db_creation/Makefile, target update_data_sources.sql.
At each step of the pipeline, you will need to update the file db_creation/update_data_sources.sql, that keeps track of the version of the data sources used for the current release. This file will be used at the end of the pipeline run, to insert this information into the database. The reason why this information is not managed by the Makefiles, is that the ways to obtain this information are too disparate between data sources (sometimes you have to look at the home page of the website, sometimes to look at a specific file, sometimes you cannot use the modification date of the file, but need to look for a release date inside the file, etc.).
Before running the pipeline on a specific machine, you need to perform some configurations:
-
in Makefile.Config: edit this file with correct values of logins and passwords. The correct values should not be versioned! (easier than to encrypt the file)
-
in Makefile.common, edit the following variables as needed:
RELEASE
: version of Bgee for which the pipeline is being runENSRELEASE
: version used of EnsemblTMP DIR
: where to store (potentially large) TMP files- Servers and ports configuration:
DBHOST
andDBPORT
for MySQL databaseANNOTATORHOST
denoting the server storing Affymetrix raw data, and Ensembl local versionDATAHOST
an additional backup machinePIPEHOST
, name of the machine on which the pipeline is run
To re-run the last operation performed by a pipeline step, remove its step_verification_RELEASE.txt
file.
To re-run the step all from scratch, use the command make clean
.
In that case, data inserted in the database are not cleaned automatically, for safety, you would need to remove inserted data yourself.
This documentation often explains how to do it.
The command clean
only takes care of the generated files.