Nextflow workflow to process whole genome sequencing (WGS) data, starting from aligned .bam files. In order to run it, two files have to be filled in:
samplesheet.csv
: must contain at least 2 columns: sample (sample name) and bam (path to the bam file). Optionally, one can also provide sex (M for male or F for female, default is F if not provided), ploidy (default: 2; only used if CNAs are called with Control-FREEC), bam_control (path to the control bam, in which case only somatic variants will be reported), and bam_RNA (for allele-specific expression).nextflow.config
: see "Configuration" below.
The pipeline can then be run with: ./run_WGS.sh nextflow.config
(the script run_WGS.sh might have to be adapted depending on your computing environment).
The pipeline will automatically download the required data files. If this fails, you can manually download the data files from https://drive.google.com/drive/folders/104TKSc-yaJ2VhO47wcaBr4QKF2tjm_Ug?usp=drive_link and put them in a data
subdirectory of your project.
Different subworkflows can be run depending on what is specified in the configuration file.
Two workflows can call copy number alterations (CNAs) and structural variants (SVs), either with only a tumor bam file, or with matched tumor + normal bams.
Two alternative workflows are provided:
- manta + ControlFreec. This will be run if
run_manta_freec=true
is specified in the config file. - HMF pipeline. This will be run if
run_HMF=true
is specified in the config file. CNAs will be detected with amber and cobalt, SVs with GRIDSS2, and purple will be used to combine the SNV and CNA calls.
Both of these workflows give similar output. manta_freec is much faster, but the HMF pipeline has the advantage of integrating SV and CNA calls.
- plots: a directory containing circos plots as well as per-chromosome plot, generated with figeno.
- breakpoints.tsv: a tsv file containing breakpoints for all samples analyzed. Columns: sample, chr1, pos1, orientation1, chr2, pos2, orientation2.
- CNAs.tsv: a tsv file containing copy number alterations for all samples analyzed. Columns: sample, chr, start, end, cn.
- freec: a directory containing per-sample CNA information produced by Control-FREEC. 3 files per sample: {sample}_CNVs (called CNAs), {sample}_ratio.txt (ratios per 10kb bin: -1 if the bin was excluded, otherwise the copy number is ploidy*ratio), {sample}_info.txt (various information for the sample, including the inferred tumor purity). [if run_manta_freec is true]
- manta: a directory containing per sample SV information produced by manta. The important file is {sample}_SVfiltered.vcf, but unfiltered files are also provided and can be helpful to investigate why some SVs were filtered out. [if run_manta_freec is true]
- purple: SVs and copy numbers inferred by purple. [if run_HMF is true]
Two workflows can be used to detect heterozygous SNVs in a DNA bam file, and count allelic read counts for those variants in an RNA bam file. This can then be used to detect genes which are only expressed from one allele.
- fast_ase. It is the recommended method, since it is much faster. It will be run if
run_ASE=true
is specified in the config file. - GATK HaplotypeCaller + ASEReadCounter. This is much slower, but might give slightly more accurate results. It will be run if
run_ASE_GATK=true
is specified in the config file.
- Somatic SNVs can be called using mutect2. For now, this workflow only supports the hg19 reference. If
run_mutect2=true
is specified, it will detect SNVs present in the list of regions provided in themutect2_target_intervals
bed file. By default, this is a list of regions corresponding to genes commonly mutated in AML, but you can also provide a different list of regions. This workflow will only use a tumor bam file (no control), so in order to select somatic variants, it will filter out SNPs present in the gnomAD database, and only select nonsynonymous variants.
- if
run_pyjacker=true
is specified in the config file, pyjacker will be run to detect genes activated by enhancer hijacking in some samples of the cohort. To be able to run this option, you must:- provide a RNA_TPM_file in the config file. This is the gene expression matrix, where rows are genes and columns are samples (with the same names as in the samplesheet). There must be a column gene_id, and optionally a column gene_name.
- either set run_manta_freec to true OR set run_HMF to true OR provide breakpoints and CNAs in the config file (corresponding to breakpoints.tsv and CNAs.tsv of the pipeline outputs).
- either set run_ASE to true OR set run_ASE_GATK to true OR provide ase_dir in the config file (corresponding to ASE/ of the pipeline outputs).
Optionally, also provide fusions and enhancers.
See nextflow.config
the complete list of parameters that need to be provided.
Most parameters can be left to their default values. The most important ones are:
- samplesheet: path to the samplesheet (.csv)
- outDir: output directory.
- reference_fa: path to the fasta file of the reference genome used for alignment.
- reference_version: "hg19" or "hg38". These are the only two reference versions supported by the pipeline.
- use_control: whether or not to use the control samples, if they are provided. This will only be used for the CNA and SV calls.
- run_manta_freec, run_HMF, run_ASE, run_ASE_GATK, run_mutect2, run_pyjacker : booleans to specify which workflows to run (see above).