Skip to content

Initial tests of SV callers

Arnold Kuzniar edited this page Jan 26, 2018 · 1 revision

Note: The following descriptions are for calling structural variants on a Tumor-Normal sample with a single Tumor sample. VCF files output by the different callers all vary somewhat and require filtering on the calls that passed and that had acceptable quality. MaxVmem, Wallclock, CPU time, and total data output are based on looking at one or two runs of each caller. These are approximate values and should be exceeded to err on the safe side.

Input files (BAM): /hpc/cog_bioinf/cuppen/processed_data/external/HMFreg0002/

Output files (VCF):

base dir: /hpc/cog_bioinf/ridder/users/cshneider/Breast_Cancer_Pilot_outside_IAP/

Manta: Z424_Z424/results/variants/somaticSV.vcf.gz

Manta: Z424_manta/results/variants/diploidSV.vcf.gz (germline)

Delly: Delly_17patient_somatic/Z424_Delly_filtered2_SVs_copy/Z424_ALL_SVs.vcf (merged)

Delly: Delly_17patient_germline/Z424_germline_Delly_filtered2_SVs_copy/Z424_ALL_SVs.vcf

Lumpy: Z424/Lumpy/somatic_SVs/Z424/lumpy_Z424.vcf

GRIDSS: GRIDSS_output/Z424_GRIDSS_1pt4_SVs/Z424_GRIDSS_somatic.sv.vcf

Manta-1.1.0

Supports CRAM.

MaxVmem: 2 GB
Wallclock: 1.5 hrs
CPU time: 10 hrs

Total data output: 16 MB in .vcf.gz format.
To unzip: for m in 'ls *.vcf.gz'; do gunzip $m; done

First creates config file and then execute runWorkflow.py Python script slightly modified from Mark.
Used 8 threads.

Delly-0.7.7

Accepts sorted and indexed bam file for every sample and the reference genome to identify split-reads. Probably does not yet support CRAM. Delly performance improves (as found by Mark) when an exclusion list .excl is used which makes use of the -x flag to exclude all chromosomes except the ones that want to look at. Thus, exclude all but a single chromosome per each of the INS,DUP,INV,DEL SVs and two chromosomes unique pairwise for BND (translocations). Hence, for the whole human genome, including sex chromosomes X,Y and the Mitochondrial DNA, have 25 (22 + X + Y + MT) x 4 (INS,DUP,INV,DEL) + (25*24/2) (BND) = 100 + 300 = 400 scripts generated. Also need a .tsv file (tab separated values) with two rows and two columns to designate which samples are the tumour and which are the normal for each of these runs. Outputs are 400 .bcf files which first need to be converted to .vcf format and then concatenated together. Possible to run tumour-normal analysis with a call-filter step or with a double call-filter repeat step as described under Somatic SV calling.
Used 8 threads.

Python script slightly modified from Mark.

chrs =['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X','Y','MT']

svtypes = [ "DEL", "INS", "INV", "DUP", "BND" ]

Example from Delly github:

delly call -t DEL -x hg19.excl -o t1.bcf -g hg19.fa tumor1.bam control1.bam

delly filter -t DEL -f somatic -o t1.pre.bcf -s samples.tsv t1.bcf

The .tsv file consists of two tab delimited columns:

Sample-ID     control
Sample-ID     tumor

where Sample-ID is the same as in the VCF file for the tumor and normal (control) sample.

Call step:

MaxVmem: 375 MB
Wallclock: 1 hr
CPU time: 8 hrs

Filter step:

MaxVmem: 4 MB
Wallclock: 1 s
CPU time: 4 s

Total data output: 24 MB

Converting and concatenating:

module load vcfbcf/bcftools
module load vcfbcf/vcftools 
for m in `ls *.pre.bcf`; do n=${m/\.pre\.bcf/_pre_bcf\.vcf}; bcftools view $m > $n; done
vcf-concat *.vcf > ALL_SVs.vcf

Lumpy-0.2.13

Does not suppot CRAM.

MaxVmem: 6 GB
Wallclock: 8 hrs
CPU time: 10.5 hrs

Total data output: 5 GB. Used 1 thread. Appears that Lumpy does not support multi-threading.

Courtesy of Robert.

#!/bin/bash

module load python

# run lumpy

mkdir -p Lumpy/somatic_SVs

run_dir=$PWD

tumor_bam=(*_T.bam) #(*/mapping/*T_dedup.realigned.bam)
echo "$tumor_bam"
tumor_bam=${tumor_bam[0]}
echo "$tumor_bam"
ref_bam=(*_B.bam) #(*/mapping/*R_dedup.realigned.bam)
echo "$ref_bam"
ref_bam=${ref_bam[0]}
echo "$ref_bam"

tumor_bam_name="${tumor_bam}" #"${tumor_bam#*/mapping/}"
echo "$tumor_bam_name"
ref_bam_name="${ref_bam}" #"${ref_bam#*/mapping/}"
echo "$ref_bam_name"
sample="${tumor_bam_name%%_*}"

echo "$sample"

mkdir $run_dir/Lumpy/somatic_SVs/$sample
cd $run_dir/Lumpy/somatic_SVs/$sample

# Step 1: Extract discordant pairs and split reads
echo "Processing $tumor_bam"
tumor_discordants_bam=${tumor_bam_name/\.bam/\.discordants\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -b -F 1294 $run_dir/$tumor_bam > $tumor_discordants_bam

tumor_splitters_bam=${tumor_bam_name/\.bam/\.splitters\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -h $run_dir/$tumor_bam \
| /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/scripts/extractSplitReads_BwaMem -i stdin \
| /hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -Sb - > $tumor_splitters_bam

/hpc/local/CentOS7/cog_bioinf/sambamba_v0.6.0/sambamba index -t 2 $tumor_splitters_bam

# Step 1: Extract discordant pairs and split reads
echo "Processing $ref_bam"
ref_discordants_bam=${ref_bam_name/\.bam/\.discordants\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -b -F 1294 $run_dir/$ref_bam > $ref_discordants_bam

ref_splitters_bam=${ref_bam_name/\.bam/\.splitters\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -h $run_dir/$ref_bam \
| /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/scripts/extractSplitReads_BwaMem -i stdin \
| /hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -Sb - > $ref_splitters_bam

/hpc/local/CentOS7/cog_bioinf/sambamba_v0.6.0/sambamba index -t 2 $ref_splitters_bam

# Step 2: Run lumpy
/hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/bin/lumpyexpress \
-K /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/bin/lumpyexpress.config \
-B $run_dir/$tumor_bam,$run_dir/$ref_bam \
-S $tumor_splitters_bam,$ref_splitters_bam \
-D $tumor_discordants_bam,$ref_discordants_bam \
-o lumpy_$sample.vcf \

GRIDSS-1.4.0

Accepts coordinate sorted SAM/BAM/CRAM files.
Uses htsjdk as a SAM/BAM/CRAM/VCF parsing library.
Version 1.3.4 and version 1.4 tested on same data set.
Following dependencies: Java, R, BWA, (Bedtools I think optional).
Used 12 threads.

1.3.4: 
MaxVmem: 38 GB
Wallclock: 20 hrs
CPU time: 148 hrs
Total data output: 1 GB
1.4.0:
MaxVmem: 27 GB
Wallclock: 11 hrs
CPU time: 76 hrs
Total data output: 1 GB

Do not specify between 32-48 GB for memory. Use 50 GB (AK: _qsub -l h_vmem=50GB). Specify wall-clock time of 14 hours.

For running on HPC:

module load Java/1.8.0_60
module load aligners/bwa/0.7.13
module load R/3.3.0
module load bed/bedtools/2.25.0
This script is a slightly modified version of the somatic.sh from    
https://github.com/PapenfussLab/gridss/blob/master/example/somatic.sh

work_dir=/PATH/TO/WORKING/DIRECTORY

NORMAL=/PATH/TO/NORMAL/SAMPLE
TUMOUR=/PATH/TO/TUMOUR/SAMPLE
REFERENCE=/PATH/TO/REFERENCE/GENOME

OUTPUT=PATH/TO/NAME/FOR/EXAMPLE/GRIDSS_somatic.sv.vcf
ASSEMBLY=${OUTPUT/.sv.vcf/.gridss.assembly.bam}
GRIDSS_JAR=PATH/TO/GRIDSS/JAR/FILE

if [[ ! -f "$NORMAL" ]] ; then
	echo "Missing $NORMAL input file."
	exit 1
fi
if [[ ! -f "$TUMOUR" ]] ; then
	echo "Missing $TUMOUR input file."
	exit 1
fi
if ! which bwa >/dev/null 2>&1 ; then
	echo "Missing bwa. Please add to PATH"
	exit 1
fi
if [[ ! -f "$REFERENCE" ]] ; then
	echo "Missing reference genome $REFERENCE. Update the REFERENCE variable in the shell script to your hg19 location"
	exit 1
fi
if [[ ! -f "$REFERENCE.bwt" ]] ; then
	echo "Missing bwa index for $REFERENCE. Could not find $REFERENCE.bwt. Create an index or symlink the index files to the expected file names."
	exit 1
fi
if [[ ! -f $GRIDSS_JAR ]] ; then
	echo "Missing $GRIDSS_JAR. Update the GRIDSS_JAR variable in the shell script to your location"
	exit 1
fi
if ! which java >/dev/null 2>&1 ; then
	echo "Missing java. Please add java 1.8 or later to PATH"
	exit 1
fi
JAVA_VERSION="$(java -version 2>&1 | head -1)"
if [[ ! "$JAVA_VERSION" =~ "\"1.8" ]] ; then
	echo "Detected $JAVA_VERSION. GRIDSS requires Java 1.8 or later."
	exit 1
fi

java -ea -Xmx31g \
	-Dgridss.defensiveGC=true \			
	-Dsamjdk.create_index=true \
	-Dsamjdk.use_async_io_read_samtools=true \
	-Dsamjdk.use_async_io_write_samtools=true \
	-Dsamjdk.use_async_io_write_tribble=true \
  	-Dsamjdk.compression_level=1 \	
	-cp $GRIDSS_JAR gridss.CallVariants \		
	TMP_DIR=/PATH/TO/TEMPORARY/DIRECTORY \
	WORKER_THREADS=12 \			
	WORKING_DIR="$work_dir" \
	REFERENCE_SEQUENCE="$REFERENCE" \
  	INPUT="$NORMAL" \
  	INPUT="$TUMOUR" \
  	OUTPUT="$OUTPUT" \
	ASSEMBLY="$ASSEMBLY" \
	2>&1 | tee -a gridss.$HOSTNAME.$$.log

Note: Sometimes GRIDSS crashes (attach error message)