Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data
This pipeline was built for Whole Exome Sequencing and Whole Genome Sequencing analysis. It provides a detailed quality controls of both frozen and FFPE samples as well as a first downstream analysis including mutation calling, structural variants and copy number analysis. Most of the pipeline steps can work for tumor/normal paired samples and tumor-only samples. VEGAN
can run from raw fastq
files or from intermediates results such as BAM/CRAM
aligned files or VCF
files.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It comes with conda / singularity containers making installation easier and results highly reproducible.
The first version of VEGAN
was inspired from the nf-core Sarek pipeline with several common processes, additional modifications and new analysis steps.
- Run quality control of raw sequencing reads (
fastqc
) - Align reads on reference genome (
bwa-mem
,bwa-mem2
,dragmap
) - Filtering and quality controls of aligned reads
- Report mapping metrics (
picard
) - Mark and remove duplicates (
markdup
) - Library complexity analysis (
Preseq
) - Filtering aligned BAM files (
SAMTools
) - Insert size distribution (
picard
) - Identity monitoring (
bcftools
/R
)
- GATK preprocessing (
GATK
) - Germline Variants calling (
haplotypecaller
/bcftools
)
- HaplotypeCaller
- Mutect2 (including learnReadOrientationModel, GetPileupSummaries, CalculateContamination)
- FilterMutectCalls
- Technical filters for somatic variants (DP, VAF, MAF) (
SnpSift
,bcftools
) - Variants annotation (
SnpEff
/SnpSift
) - Copy-number analysis (
ASCAT
,FACETS
) - Structural variants analysis (
MANTA
) - Biomarkers analysis
- Microsatellite instability analysis (
MSIsensor-pro
) - Tumor Mutational Burden (
pyTMB
)
- Gather all QC results in a final report (
MultiQC
)
nextflow run main.nf --help
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [lethal_torricelli] - revision: 4d570988d2
------------------------------------------------------------------------
_ _ _____ __ __ _____ ____ _ _ _
| \ | | | ___| \ \ / / | ____| / ___| / \ | \ | |
| \| | | |_ _____ \ \ / / | _| | | _ / _ \ | \| |
| |\ | | _| |_____| \ V / | |___ | |_| | / ___ \ | |\ |
|_| \_| |_| \_/ |_____| \____| /_/ \_\ |_| \_|
VEGAN v2.3.0
------------------------------------------------------------------------
Usage:
The typical command for running the pipeline is as follows:
nextflow run main.nf --profile STRING --samplePlan PATH --design PATH --step STRING --genome STRING --genomeAnnotationPath PATH
MANDATORY ARGUMENTS:
--design PATH Path to designf ile specifying the metadata ssociated with the samples
--genome STRING [hg19, hg19_base, hg38, hg38_base, mm10, mm39,...] Name of the reference genome.
--genomeAnnotationPath PATH PATH to the reference genome folder.
--profile STRING [test, multiconda, singularity, cluster, docker, conda, path, multipath] Configuration profile to use. Can use multiple (comma separated).
--step STRING [mapping, markduplicates, filtering, calling, annotate] Specify starting step
--outDir PATH The output directory where the results will be saved
--tools STRING [haplotypecaller, mutect2, manta, snpeff, facets, ascat, tmb, msisensor] Specify tools to use for variant calling
INPUTS:
--reads PATH Path to input data (must be surrounded with quotes)
--samplePlan PATH Path to sample plan (csv format) raw reads (if `--reads` is not secified), or intermediate files according to the `--step` parameter
--singleEnd For single-end input data
--splitFastq Split fastq files in chunks
--fastqChunksSize INTEGER Reads chunks size
ALIGNMENT:
--aligner STRING [bwa-mem, bwa-mem2, dragmap] Specify tools to use for mapping
--cram Generate CRAM alignment files
--mapQual INTEGER Minimum mapping quality to consider for an alignment
--saveAlignedIntermediates Save intermediates alignment files
--splitFastq Split fastq files in chunks
FILTERING:
--keepDups Specify to keep duplicate reads when filtering the alignment
--keepMultiHits Specify to keep multi hit reads when filtering the alignment
--keepSingleton Specify to keep singleton reads when filtering the alignment
--targetBed PATH Target Bed file for targeted or whole exome sequencing
VARIANT CALLING:
--baseQual INTEGER Minimum base quality used by Facets for CNV calling
--saveVcfIntermediates Save intermediate vcf files
--saveVcfMetrics Save complementary vcf metrics files
--skipMutectContamination Do not apply the Contamination step for Mutect2 calls filtering
--skipMutectOrientationModel Do not apply the LearnOrientationModel step for Mutect2 calls filtering
TUMOR ONLY:
--msiBaselineConfig PATH PATH to Msisensor-pro baseline config file for tumor-only mode
--pon PATH PATH to panels of normals (.vcf.gz)
--ponIndex PATH PATH to panels of normals index file (.tbi)
VCF FILTERS:
--filterSomaticDP INTEGER Minimum sequencing depth to consider a somatic variant
--filterSomaticMAF INTEGER Maximum variant frequency in the general population to consider a somatic variant
--filterSomaticVAF INTEGER Minimum variant allele frequency to consider a somatic variant
ANNOTATION:
--annotDb STRING [cosmic, icgc, cancerhotspots, gnomad, dbnsfp] Annotation databases to use with SnpEff and SnpSift
--ffpe Specify to use the ffpe parameters and filters for TMB computation
SKIP OPTIONS:
--skipBQSR Disable BQSR
--skipBamQC Disable QCs on BAM files
--skipFastqc Disable Fastqc
--skipIdentito Disable Identito
--skipMultiqc Disable MultiQC
--skipSaturation Disable Preseq
OTHER OPTIONS:
--disableAutoClean Disable cleaning of work directory
--multiqcConfig PATH Specify a custom config file for MultiQC
--name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic
--sequencingCenter STRING Name of sequencing center to be displayed in BAM file
=======================================================
Available Profiles
-profile test Run the test dataset
-profile conda Build a new conda environment before running the pipeline. Use `--condaCacheDir` to define the conda cache path
-profile multiconda Build a new conda environment per process before running the pipeline. Use `--condaCacheDir` to define the conda cache path
-profile path Use the installation path defined for all tools. Use `--globalPath` to define the insallation path
-profile multipath Use the installation paths defined for each tool. Use `--globalPath` to define the insallation path
-profile docker Use the Docker images for each process
-profile singularity Use the Singularity images for each process. Use `--singularityPath` to define the insallation path
-profile cluster Run the workflow on the cluster, instead of locally
The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow
The test dataset is a downsampled Whole Exome Sequencing. It can be launched with the following command.
nextflow run main.nf -profile test,multiconda \
--step mapping \ # or filtering, calling, annotate
--condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-2.0.0/ \
--genomeAnnotationPath /data/annotations/pipelines/
Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers
nextflow run main.nf -profile singularity,cluster \
--samplePlan samples-WES.csv \
--design samples.design.csv \
--step mapping \
--singularityImagePath /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/singularity/vegan-2.0.0/images/ \
--targetBed capture.bed \
--tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor \
--genome hg38 --genomeAnnotationPath /data/annotations/pipelines/ \
-resume
nextflow run main.nf -profile multiconda,cluster \
--samplePlan samples-WES.csv \
--design samples.design.csv \
--step mapping \
--targetBed capture.bed \
--tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor,ascat \
--genome hg38 \
--genomeAnnotationPath /data/annotations/pipelines/ \
--condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-1.2.0/ \
To build new conda environments, point to an empty folder for --condaCacheDir
parameter
By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH
variable.
In addition, we set up a few profiles that should allow you
-
- to use containers instead of local installation,
-
- to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).
Here are a few examples of how to set the profile option. See the full documentation for details.
## Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)
-profile path --globalPath INSTALLATION_PATH
## Run the pipeline on the cluster, using the Singularity containers
-profile cluster,singularity --singularityImagePath SINGULARITY_PATH
## Run the pipeline on the cluster, building new conda environments
-profile cluster,multiconda --condaCacheDir CONDA_CACHE
A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.
The sample plan is expected to be created as below :
SAMPLE_ID,SAMPLE_NAME,PATH_TO_R1_FASTQ,[PATH_TO_R2_FASTQ]
A design file is a csv file that list all experimental samples, their IDs, the associated germinal sample, the sex of the patient and the status (tumor / normal). The design control is expected to have the following header :
GERMLINE_ID,TUMOR_ID,PAIR_ID,SEX
Both files will be checked by the pipeline and have to be rigorously defined in order to make the pipeline work. Note that the control is optional if not available but is highly recommanded. If the design file is not specified, the pipeline will run until the alignment. The variant calling and the annotation will be skipped.
- Installation
- Geniac
- Reference genomes
- Running the pipeline
- Profiles
- Output and how to interpret the results
- Troubleshooting
This pipeline has been written by the Institut Curie bioinformatics platform (PA. Nicolas, T. Gutman, F. Jarlier, F. Allain, , P. La Rosa, P. Hupe, N. Servant). The project was funded by the European Union’s Horizon 2020 research and innovation programme and the Canadian Institutes of Health Research under the grant agreement No 825835 in the framework of the European-Canadian Cancer Network, as well as the Canceropole Ile de France (GENOPROFILE - RIC2021) project.
For any question, bug or suggestion, please send an issue or contact the bioinformatics core facility.