This is the main genome analytics workflow powering the production analysis of whole genome samples for the Singapore National Precision Medicine (NPM) Program Phase 1A, sometimes also referred to as SG10K Health. It processes samples from FastQ to lossless CRAM, computes multiple QC metrics as well as Freebayes variant calls and GATK4 gvcfs.
To ensure reproducibility, scalability and mobility the workflow is implemented as Nextflow recipe and uses containers (Singularity on NSCC's Aspire 1 and Docker on AWS Batch). Container building is simplified by the use of Bioconda.
All results can be found in the results
folder of a pipeline
execution. Results there are grouped per sample, with the exception of
Goleft indexcov, which summarises over the sample set.
- GATK4 gVCF (indexed):
{sample}/{sample}.g.vcf.gz
- Freebayes VCF (Q>=20; indexed):
{sample}/{sample}.fb.vcf.gz
- CRAM (lossless, with OQ, indexed):
{sample}/{sample}.bqsr.cram
- Goleft indexcov:
indexcov/all/
(main fileindexcov/all/all.html
) - Samtools stats:
{sample}/stats/
(main files:{sample}/stats/{sample}.stats
and{sample}/stats/{sample}.html
) - Verifybamid for the three ethnicities:
{sample}/verifybamid/
(main files:{sample}/verifybamid/{sample}.SGVP_MAF0.01.{ethnicity}.selfSM
) - Coverage as per SOP:
{sample}/{sample}.cov-062017.txt
- We share this code for transparency. This is not meant to be a generic whole genome workflow for wider use, but rather specific to the program's needs. For the same reason this documentation is rudimentary.
- See this file for the execution DAG
- GATK commandline parameters are based on the official WDL implementation
- Developers: work on devel or feature branches. Only merge to master if
tests/run.sh
completes successfully
The workflow was implemented in the Genome Institute of Singapore (GIS) by:
- Lavanya VEERAVALLI veeravallil@gis.a-star.edu.sg
- Andreas WILM wilma@gis.a-star.edu.sg