Parabricks-Genomics-nf is a GPU-enabled pipeline for alignment and germline short variant calling for short read sequencing data. The pipeline utilises NVIDIA's Clara Parabricks toolkit to dramatically speed up the execution of best practice bioinformatics tools. Currently, this pipeline is configured specifically for NCI's Gadi HPC.
NVIDIA's Clara Parabricks can deliver a significant speed improvement over traditional CPU-based methods, and is designed to be used only with NVIDIA GPUs. This pipeline is suitable for population screening projects as it executes Parabrick's implementations of BWA mem for short read alignment and Google's DeepVariant for short variant calling. Additionally, it uses standard CPU implementations of data quality evaluation tools FastQC and MultiQC and DNAnexus' GLnexus for scalable gVCF merging and joint variant calling. Optionally, Variant Effect Predictor (VEP) can be run for variant annotation.
DeepVariant and GLnexus have been shown to outperform GATK's HaplotypeCaller and GenomicsDB in terms of accuracy and scalability (Yun et al. 2020, Barbitoff et al. 2022, Lin et al. 2022, Abdelwahab et al. 2023). DeepVariant employs a deep learning technique to call short variants (SNVs and indels) which leads to higher accuracy and sensitivity compared with traditional variant callers. However, it has only been trained on human data.
Users intending to use this pipeline for non-human species should consider the findings of this post regarding Mendelian violation rate to determine if this pipeline meets their needs. GLnexus is a scalable joint genotyper that can merge gVCFs from multiple samples and call variants in a single step. Please submit a feature request issue on this repository to request support for non-human species.
NCI has installed a slightly customised version of Nextflow on Gadi that allows you to simplify the use of the clusterOptions
directive, specific to the system. See NCI's Nextflow guide for details and the gadi.config
for how this has been implemented in this pipeline.
Download the code contained in this repository with:
git clone https://github.com/Sydney-Informatics-Hub/Parabricks-Genomics-nf.git
This will create a directory with the following structure:
Parabricks-Genomics-nf/
├── assets/
├── bin/
├── config/
├── LICENSE
├── main.nf
├── modules/
├── nextflow.config
└── README.md
The important features are:
- main.nf contains the main nextflow script that calls all the processes in the workflow.
- nextflow.config contains default parameters to use in the pipeline.
- config contains infrastructure-specific configuration files.
- modules contains individual process files for each step in the workflow.
- bin contains scripts executed by the pipeline
Keep in mind that this pipeline assumes all your data will be contained within one project code's /scratch
and /g/data
directories, as specified by the parameter --gadi_account
. If you are working across multiple project codes, you will need to manually edit this line in the gadi.config
to reflect this:
storage = "scratch/${params.gadi_account}+gdata/${params.gadi_account}"
As with all Gadi jobs, Nextflow processes run by this pipeline will only be able to access project directories specified by the # PBS -lstorage
variable.
To run this pipeline you will need the following sample inputs:
- Paired-end fastq files
- Input sample sheet
This pipeline processes paired-end fastq files and is capable of processing multiple samples in parallel. You will need to create a sample sheet with all metadata for the cohort you are processing, before running the pipeline. This file must be comma-separated and contain a header and one row per sample. Columns should correspond to sample
,fq1
,fq2
,platform
,library
,center
, where:
- header must be written as sample,fq1,fq2,platform,library,center
sample
is the unique identifier for the samplefq1
is the path to the first fastq filefq2
is the path to the second fastq fileplatform
is the sequencing platform used (e.g. illumina)library
is the library identifier (e.g. 1)center
is the sequencing center (e.g. nfcoreraredisease)
For samples with > 1 fastq pairs, create a new line for each fastq pair. The pipeline is designed to handle multiple fastq pairs per sample if required.
Here is an example of what your input file should look like. Note sample1
has multiple fastq pairs across two lanes, while sample2
and sample3
have one fastq pair each:
sample,fq1,fq2,platform,library,center
sample1,/scratch/aa00/sample1_L001_1_1k.fastq.gz,/scratch/aa00/sample1_L001_2_1k.fastq.gz,illumina,1,Ramaciotti
sample1,/scratch/aa00/sample1_L002_1_1k.fastq.gz,/scratch/aa00/sample1__L002_2_1k.fastq.gz,illumina,1,Ramaciotti
sample2,/scratch/aa00/sample2_1_1k.fastq.gz,/scratch/aa00/sample2_2_1k.fastq.gz,illumina,1,Ramaciotti
sample3,/scratch/aa00/sample3_1_1k.fastq.gz,/scratch/aa00/sample3_2_1k.fastq.gz,illumina,1,Ramaciotti
When you run the pipeline, you will use the mandatory --input
parameter to specify the location and name of the input file:
--input /path/to/samples.tsv
To run this pipeline you will need the following reference files:
- A reference genome in FASTA format
- [Optional] A VEP cache for variant annotation with VEP
You will need to download a copy of the reference genome you would like to use. If you are working with a species that has a public reference genome, you can download FASTA files from the Ensembl, UCSC, or NCBI ftp sites.
If your reference fasta is not already indexed, the bwa_index
process in this pipeline will index the provided fasta file using bwa index
.
When you run the pipeline, you will use the mandatory --ref
parameter to specify the location and name of the reference.fasta file:
--ref /path/to/reference.fasta
An optional feature of this workflow is to annotate output variants with Ensembl's Variant Effect Predictor. You can opt to download the required cache using the --download_vep_cache
flag, additionally specifying the species and assembly you require:
--download_vep_cache
--vep_species homo_sapiens
--vep_assembly GRCh38
For available species caches, please see the VEP cache download site to determine which species and assembly you need to specify. Please keep in mind that VEP requires you to specify species and assembly as they are named in these files.
The most basic run command for this pipeline is:
nextflow run main.nf --input sample.tsv --ref /path/to/ref -gadi_account <account-code> -profile gadi
This will run the following processes:
- Check validity of provided inputs
- FastQC for summary of raw reads
- BWA index if bwa indexes for your provided reference fasta are not detected
- Parabricks' fq2bam for performing read alignment and generating bam files for individuals
- Parabricks' deepvariant for performing SNV/indel genotyping for individuals
- Parabricks' bammetrics for generating alignment metrics
- Glnexus for joint genotyping of all individuals and creation of a cohort-level BCF file
- BCFtools convert to convert the cohort BCF to a VCF
- BCFtools stats for summary statistics of the cohort VCF
- MultiQC for generating a summary html report of all processes
Additionally, you can run variant annotation using variant effect predictor by adding some additional flags --download_vep_cache
, --vep_species
, and --vep_assembly
as specified in section 2.2 above:
nextflow run main.nf --input <samplesheet.csv> --ref <ref.fasta> --download_vep_cache --vep_species <species> --vep_assembly <genome assembly> -gadi_account <account-code> -profile gadi
In addition to all steps above, this will run:
- VEP download to collect the cache required for variant annotation
- VEP annotate to annotate the cohort VCF with downloade cache data
You will need to identify which species and assembly you require for the VEP cache download. See section 2.2 above for instructions on how to do this. For example, running on a human dataset, aligned to the GRCh38 reference genome, you would use the following command:
nextflow run main.nf --input samplesheet.csv --ref path/to/GRCh38.fasta --download_vep_cache --vep_species homo_sapiens --vep_assembly GRCh38 -gadi_account <account-code> -profile gadi
If you are working across multiple project spaces on Gadi and need access to additional storage beyond -gadi_account <account-code>
, you can use the -storage_account <account-code>
to specify access to storage in addition to /g/data
and /scratch
for connected to the -gadi_account
project.
You can see all flags supported by this pipeline by running:
nextflow run main.nf --help
By default, all outputs of this pipeline will be saved to a directory called results
. You can customise the name of this directory using the --outdir
flag. The structure of the saved outputs are:
results/
├── annotations/
├── bams/
├── fastqc/
├── multiqc/
├── variants/
└── VEP_cache/
fastqc/
contains fastq-level summary reports organised into sample-level subdirectoriesbams/
contains alignment files and qc reports organised into sample-level subdirectoriesvariants/
contains sample-level g.vcf files organised into sample-level subdirectories, a cohort-level bcf and vcf file, and a cohort-level variant summary fileannotations/
contains an annotated cohort-level VEP file and cohort-level summary reportmultiqc/
contains a summary report of all processes run by the pipelineVEP_cache/
contains the downloaded VEP cache files used for variant annotation
As first port of call, we recommend looking through the MultiQC report to confirm there are no unexpected results or quality issues with any samples.
To run this pipeline on infrastructures other than NCI Gadi, you will need to create a custom configuration file and include this within the profiles {}
section in the nextflow.config
.
Metadata field | Parabricks-Genomics-nf / v1.0.0 |
---|---|
Version | 1.0.0 |
Maturity | First release |
Creators | Georgie Samaha |
Source | NA |
License | GNU General Public License v3.0 |
Workflow manager | Nextflow |
Container | See component tools |
Install method | NA |
GitHub | https://github.com/Sydney-Informatics-Hub/Parabricks-Genomics-nf |
bio.tools | NA |
BioContainers | NA |
bioconda | NA |
To run this pipeline you must have Nextflow and Singularity installed. All other tools are run as containers using Singularity.
Tool | Version |
---|---|
Nextflow | >=20.07.1 |
Singularity | |
Parabricks | 4.2.0 |
BCFtools | 1.17 |
BWA | 0.7.17 |
FastQC | 0.12.1 |
Glnexus | 0.2.7 |
MultiQC | 1.21 |
VEP | 110.1 |
Parabricks error reporting can be a bit vague sometimes. If you find you are receiving an error:
cudaSafeCall() failed at ParaBricks/src/samGenerator.cu/819: an illegal memory access was encountered
Check your fasta is formatted correctly and that your input files are all correctly captured by the storage
variable in the nextflow.config
.
This download speed for this process can be a bit inconsistent and is dependent on Ensembl's ftp server. Sometimes it takes a couple of minutes, other times it can take hours. If you are having trouble downloading the cache, delete the specific work directory corresponding to the download process and rerun the pipeline using the -resume
flag.
Submit an issue to get in contact with us regarding:
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.
- Georgina Samaha (Sydney Informatics Hub, University of Sydney)
- Cali Willet (Sydney Informatics Hub, University of Sydney)
The authors acknowledge the support provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government, and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia funding.
Samaha, G. Willet, C. (2024). Parabricks-Genomics-nf [Computer software] https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.836.1