-
Notifications
You must be signed in to change notification settings - Fork 35
Initial tests of SV callers
Note: The following descriptions are for calling structural variants on a Tumor-Normal sample with a single Tumor sample. VCF files output by the different callers all vary somewhat and require filtering on the calls that passed and that had acceptable quality. MaxVmem, Wallclock, CPU time, and total data output are based on looking at one or two runs of each caller. These are approximate values and should be exceeded to err on the safe side.
Input files (BAM): /hpc/cog_bioinf/cuppen/processed_data/external/HMFreg0002/
Output files (VCF):
base dir: /hpc/cog_bioinf/ridder/users/cshneider/Breast_Cancer_Pilot_outside_IAP/
Manta: Z424_Z424/results/variants/somaticSV.vcf.gz
Manta: Z424_manta/results/variants/diploidSV.vcf.gz
(germline)
Delly: Delly_17patient_somatic/Z424_Delly_filtered2_SVs_copy/Z424_ALL_SVs.vcf
(merged)
Delly: Delly_17patient_germline/Z424_germline_Delly_filtered2_SVs_copy/Z424_ALL_SVs.vcf
Lumpy: Z424/Lumpy/somatic_SVs/Z424/lumpy_Z424.vcf
GRIDSS: GRIDSS_output/Z424_GRIDSS_1pt4_SVs/Z424_GRIDSS_somatic.sv.vcf
Supports CRAM.
MaxVmem: 2 GB
Wallclock: 1.5 hrs
CPU time: 10 hrs
Total data output: 16 MB in .vcf.gz
format.
To unzip: for m in 'ls *.vcf.gz'; do gunzip $m; done
First creates config file and then execute runWorkflow.py
Python script slightly modified from Mark.
Used 8 threads.
Accepts sorted and indexed bam file for every sample and the reference genome to identify split-reads. Probably does not yet support CRAM. Delly performance improves (as found by Mark) when an exclusion list .excl
is used which makes use of the -x
flag to exclude all chromosomes except the ones that want to look at. Thus, exclude all but a single chromosome per each of the INS,DUP,INV,DEL SVs and two chromosomes unique pairwise for BND (translocations). Hence, for the whole human genome, including sex chromosomes X,Y and the Mitochondrial DNA, have 25 (22 + X + Y + MT) x 4 (INS,DUP,INV,DEL) + (25*24/2) (BND) = 100 + 300 = 400 scripts generated. Also need a .tsv
file (tab separated values) with two rows and two columns to designate which samples are the tumour and which are the normal for each of these runs. Outputs are 400 .bcf
files which first need to be converted to .vcf
format and then concatenated together. Possible to run tumour-normal analysis with a call-filter step or with a double call-filter repeat step as described under Somatic SV calling.
Used 8 threads.
Python script slightly modified from Mark.
chrs =['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X','Y','MT']
svtypes = [ "DEL", "INS", "INV", "DUP", "BND" ]
Example from Delly github:
delly call -t DEL -x hg19.excl -o t1.bcf -g hg19.fa tumor1.bam control1.bam
delly filter -t DEL -f somatic -o t1.pre.bcf -s samples.tsv t1.bcf
The .tsv file consists of two tab delimited columns:
Sample-ID control
Sample-ID tumor
where Sample-ID is the same as in the VCF file for the tumor and normal (control) sample.
Call step:
MaxVmem: 375 MB
Wallclock: 1 hr
CPU time: 8 hrs
Filter step:
MaxVmem: 4 MB
Wallclock: 1 s
CPU time: 4 s
Total data output: 24 MB
Converting and concatenating:
module load vcfbcf/bcftools
module load vcfbcf/vcftools
for m in `ls *.pre.bcf`; do n=${m/\.pre\.bcf/_pre_bcf\.vcf}; bcftools view $m > $n; done
vcf-concat *.vcf > ALL_SVs.vcf
Does not suppot CRAM.
MaxVmem: 6 GB
Wallclock: 8 hrs
CPU time: 10.5 hrs
Total data output: 5 GB. Used 1 thread. Appears that Lumpy does not support multi-threading.
Courtesy of Robert.
#!/bin/bash
module load python
# run lumpy
mkdir -p Lumpy/somatic_SVs
run_dir=$PWD
tumor_bam=(*_T.bam) #(*/mapping/*T_dedup.realigned.bam)
echo "$tumor_bam"
tumor_bam=${tumor_bam[0]}
echo "$tumor_bam"
ref_bam=(*_B.bam) #(*/mapping/*R_dedup.realigned.bam)
echo "$ref_bam"
ref_bam=${ref_bam[0]}
echo "$ref_bam"
tumor_bam_name="${tumor_bam}" #"${tumor_bam#*/mapping/}"
echo "$tumor_bam_name"
ref_bam_name="${ref_bam}" #"${ref_bam#*/mapping/}"
echo "$ref_bam_name"
sample="${tumor_bam_name%%_*}"
echo "$sample"
mkdir $run_dir/Lumpy/somatic_SVs/$sample
cd $run_dir/Lumpy/somatic_SVs/$sample
# Step 1: Extract discordant pairs and split reads
echo "Processing $tumor_bam"
tumor_discordants_bam=${tumor_bam_name/\.bam/\.discordants\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -b -F 1294 $run_dir/$tumor_bam > $tumor_discordants_bam
tumor_splitters_bam=${tumor_bam_name/\.bam/\.splitters\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -h $run_dir/$tumor_bam \
| /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/scripts/extractSplitReads_BwaMem -i stdin \
| /hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -Sb - > $tumor_splitters_bam
/hpc/local/CentOS7/cog_bioinf/sambamba_v0.6.0/sambamba index -t 2 $tumor_splitters_bam
# Step 1: Extract discordant pairs and split reads
echo "Processing $ref_bam"
ref_discordants_bam=${ref_bam_name/\.bam/\.discordants\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -b -F 1294 $run_dir/$ref_bam > $ref_discordants_bam
ref_splitters_bam=${ref_bam_name/\.bam/\.splitters\.bam}
/hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -h $run_dir/$ref_bam \
| /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/scripts/extractSplitReads_BwaMem -i stdin \
| /hpc/local/CentOS7/cog_bioinf/samtools-1.3/samtools view -Sb - > $ref_splitters_bam
/hpc/local/CentOS7/cog_bioinf/sambamba_v0.6.0/sambamba index -t 2 $ref_splitters_bam
# Step 2: Run lumpy
/hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/bin/lumpyexpress \
-K /hpc/local/CentOS7/cog_bioinf/lumpy-sv_v0.2.13/bin/lumpyexpress.config \
-B $run_dir/$tumor_bam,$run_dir/$ref_bam \
-S $tumor_splitters_bam,$ref_splitters_bam \
-D $tumor_discordants_bam,$ref_discordants_bam \
-o lumpy_$sample.vcf \
Accepts coordinate sorted SAM/BAM/CRAM files.
Uses htsjdk as a SAM/BAM/CRAM/VCF parsing library.
Version 1.3.4 and version 1.4 tested on same data set.
Following dependencies: Java, R, BWA, (Bedtools I think optional).
Used 12 threads.
1.3.4:
MaxVmem: 38 GB
Wallclock: 20 hrs
CPU time: 148 hrs
Total data output: 1 GB
1.4.0:
MaxVmem: 27 GB
Wallclock: 11 hrs
CPU time: 76 hrs
Total data output: 1 GB
Do not specify between 32-48 GB for memory. Use 50 GB (AK: _qsub -l h_vmem=50GB). Specify wall-clock time of 14 hours.
For running on HPC:
module load Java/1.8.0_60
module load aligners/bwa/0.7.13
module load R/3.3.0
module load bed/bedtools/2.25.0
This script is a slightly modified version of the somatic.sh from
https://github.com/PapenfussLab/gridss/blob/master/example/somatic.sh
work_dir=/PATH/TO/WORKING/DIRECTORY
NORMAL=/PATH/TO/NORMAL/SAMPLE
TUMOUR=/PATH/TO/TUMOUR/SAMPLE
REFERENCE=/PATH/TO/REFERENCE/GENOME
OUTPUT=PATH/TO/NAME/FOR/EXAMPLE/GRIDSS_somatic.sv.vcf
ASSEMBLY=${OUTPUT/.sv.vcf/.gridss.assembly.bam}
GRIDSS_JAR=PATH/TO/GRIDSS/JAR/FILE
if [[ ! -f "$NORMAL" ]] ; then
echo "Missing $NORMAL input file."
exit 1
fi
if [[ ! -f "$TUMOUR" ]] ; then
echo "Missing $TUMOUR input file."
exit 1
fi
if ! which bwa >/dev/null 2>&1 ; then
echo "Missing bwa. Please add to PATH"
exit 1
fi
if [[ ! -f "$REFERENCE" ]] ; then
echo "Missing reference genome $REFERENCE. Update the REFERENCE variable in the shell script to your hg19 location"
exit 1
fi
if [[ ! -f "$REFERENCE.bwt" ]] ; then
echo "Missing bwa index for $REFERENCE. Could not find $REFERENCE.bwt. Create an index or symlink the index files to the expected file names."
exit 1
fi
if [[ ! -f $GRIDSS_JAR ]] ; then
echo "Missing $GRIDSS_JAR. Update the GRIDSS_JAR variable in the shell script to your location"
exit 1
fi
if ! which java >/dev/null 2>&1 ; then
echo "Missing java. Please add java 1.8 or later to PATH"
exit 1
fi
JAVA_VERSION="$(java -version 2>&1 | head -1)"
if [[ ! "$JAVA_VERSION" =~ "\"1.8" ]] ; then
echo "Detected $JAVA_VERSION. GRIDSS requires Java 1.8 or later."
exit 1
fi
java -ea -Xmx31g \
-Dgridss.defensiveGC=true \
-Dsamjdk.create_index=true \
-Dsamjdk.use_async_io_read_samtools=true \
-Dsamjdk.use_async_io_write_samtools=true \
-Dsamjdk.use_async_io_write_tribble=true \
-Dsamjdk.compression_level=1 \
-cp $GRIDSS_JAR gridss.CallVariants \
TMP_DIR=/PATH/TO/TEMPORARY/DIRECTORY \
WORKER_THREADS=12 \
WORKING_DIR="$work_dir" \
REFERENCE_SEQUENCE="$REFERENCE" \
INPUT="$NORMAL" \
INPUT="$TUMOUR" \
OUTPUT="$OUTPUT" \
ASSEMBLY="$ASSEMBLY" \
2>&1 | tee -a gridss.$HOSTNAME.$$.log
Note: Sometimes GRIDSS crashes (attach error message)