In this case study, we describe applying DeepTrio to a real WGS trio. Then we
assess the quality of the DeepTrio variant calls with hap.py
. In addition we
evaluate a mendelian violation rate for a merged VCF.
To make it faster to run over this case study, we run only on chromosome 20.
Docker will be used to run DeepTrio and hap.py,
We will be using GRCh38 for this case study.
mkdir -p reference
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.fai
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG002, HG003, and HG004 trio.
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
We'll use HG002, HG003, HG004 Illumina WGS reads publicly available from the PrecisionFDA Truth v2 Challenge.
mkdir -p input
HTTPDIR=https://storage.googleapis.com/deepvariant/case-study-testdata
curl ${HTTPDIR}/HG002.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam > input/HG002.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam
curl ${HTTPDIR}/HG002.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai > input/HG002.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai
curl ${HTTPDIR}/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam > input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam
curl ${HTTPDIR}/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai > input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai
curl ${HTTPDIR}/HG004.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam > input/HG004.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam
curl ${HTTPDIR}/HG004.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai > input/HG004.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam.bai
DeepTrio pipeline consists of 4 steps: make_examples
, call_variants
,
postprocess_variants
and GLnexus merge
. It is possible to run DeepTrio with
one command using the run_deepvariant
script. GLnexus is run as a separate
command.
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION=1.3.0
time sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:deeptrio-"${BIN_VERSION}" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads_child /input/HG002.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--reads_parent1 /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--reads_parent2 /input/HG004.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--output_vcf_child /output/HG002.output.vcf.gz \
--output_vcf_parent1 /output/HG003.output.vcf.gz \
--output_vcf_parent2 /output/HG004.output.vcf.gz \
--sample_name_child 'HG002' \
--sample_name_parent1 'HG003' \
--sample_name_parent2 'HG004' \
--num_shards $(nproc) \
--regions chr20 \
--intermediate_results_dir /output/intermediate_results_dir \
--output_gvcf_child /output/HG002.g.vcf.gz \
--output_gvcf_parent1 /output/HG003.g.vcf.gz \
--output_gvcf_parent2 /output/HG004.g.vcf.gz
By specifying --model_type WGS
, you'll be using a model that is best suited
for Illumina Whole Genome Sequencing data.
--intermediate_results_dir
flag is optional. By specifying it, the
intermediate outputs of make_examples
and call_variants
stages can be found
in the directory. After the command, you can find these files in the directory:
call_variants_output_child.tfrecord.gz
call_variants_output_parent1.tfrecord.gz
call_variants_output_parent2.tfrecord.gz
gvcf_child.tfrecord-?????-of-?????.gz
gvcf_parent1.tfrecord-?????-of-?????.gz
gvcf_parent2.tfrecord-?????-of-?????.gz
make_examples_child.tfrecord-?????-of-?????.gz
make_examples_parent1.tfrecord-?????-of-?????.gz
make_examples_parent2.tfrecord-?????-of-?????.gz
For running on GPU machines, or using Singularity instead of Docker, see Quick Start.
At this step we take all 3 VCFs generated in the previous step and merge them using GLnexus.
# bcftools and bgzip are now included in our docker images.
# You can also install them separately.
sudo docker run \
-v "${PWD}/output":"/output" \
quay.io/mlin/glnexus:v1.2.7 \
/usr/local/bin/glnexus_cli \
--config DeepVariant_unfiltered \
/output/HG002.g.vcf.gz \
/output/HG003.g.vcf.gz \
/output/HG004.g.vcf.gz \
| sudo docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}" \
bcftools view - \
| sudo docker run -i google/deepvariant:deeptrio-"${BIN_VERSION}" \
bgzip -c > output/HG002_trio_merged.vcf.gz
After completion of GLnexus command we should have a new merged VCF file in the output directory.
HG002_trio_merged.vcf.gz
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
realtimegenomics/rtg-tools format \
-o /reference/GRCh38_no_alt_analysis_set.sdf "/reference/GRCh38_no_alt_analysis_set.fasta"
FILE="reference/trio.ped"
cat <<EOM >$FILE
#PED format pedigree
#
#fam-id/ind-id/pat-id/mat-id: 0=unknown
#sex: 1=male; 2=female; 0=unknown
#phenotype: -9=missing, 0=missing; 1=unaffected; 2=affected
#
#fam-id ind-id pat-id mat-id sex phen
1 HG002 HG003 HG004 1 0
1 HG003 0 0 1 0
1 HG004 0 0 2 0
EOM
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/output":"/output" \
realtimegenomics/rtg-tools mendelian \
-i "/output/HG002_trio_merged.vcf.gz" \
-o "/output/HG002_trio_annotated.output.vcf.gz" \
--pedigree=/reference/trio.ped \
-t /reference/GRCh38_no_alt_analysis_set.sdf \
| tee output/deepvariant.input_rtg_output.txt
As a result we should get the following output:
Checking: /output/HG002_trio_merged.vcf.gz
Family: [HG003 + HG004] -> [HG002]
81 non-pass records were skipped
Concordance HG002: F:137455/138929 (98.94%) M:137567/139104 (98.90%) F+M:134901/137492 (98.12%)
Sample HG002 has less than 99.0 concordance with both parents. Check for incorrect pedigree or sample mislabelling.
0/144439 (0.00%) records did not conform to expected call ploidy
141835/144439 (98.20%) records were variant in at least 1 family member and checked for Mendelian constraints
3841/141835 (2.71%) records had indeterminate consistency status due to incomplete calls
2961/141835 (2.09%) records contained a violation of Mendelian constraints
mkdir -p happy
sudo docker pull jmcdani20/hap.py:v0.3.12
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG002.output.vcf.gz \
-f /benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG002.output \
--engine=vcfeval \
--pass-only \
-l chr20
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG003.output.vcf.gz \
-f /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG003.output \
--engine=vcfeval \
--pass-only \
-l chr20
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG004.output.vcf.gz \
-f /benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/HG004.output \
--engine=vcfeval \
--pass-only \
-l chr20
Benchmarking Summary for HG002:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 11256 11209 47 21264 16 9609 10 5 0.995824 0.998627 0.451891 0.997224 NaN NaN 1.561710 2.081021
INDEL PASS 11256 11209 47 21264 16 9609 10 5 0.995824 0.998627 0.451891 0.997224 NaN NaN 1.561710 2.081021
SNP ALL 71333 71063 270 87471 30 16324 4 3 0.996215 0.999578 0.186622 0.997894 2.314904 2.048981 1.715978 1.715178
SNP PASS 71333 71063 270 87471 30 16324 4 3 0.996215 0.999578 0.186622 0.997894 2.314904 2.048981 1.715978 1.715178
Benchmarking Summary for HG003:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10588 40 21060 16 9999 12 4 0.996236 0.998553 0.474786 0.997394 NaN NaN 1.748961 2.256561
INDEL PASS 10628 10588 40 21060 16 9999 12 4 0.996236 0.998553 0.474786 0.997394 NaN NaN 1.748961 2.256561
SNP ALL 70166 69928 238 85229 47 15216 18 2 0.996608 0.999329 0.178531 0.997967 2.296566 2.066067 1.883951 1.875008
SNP PASS 70166 69928 238 85229 47 15216 18 2 0.996608 0.999329 0.178531 0.997967 2.296566 2.066067 1.883951 1.875008
Benchmarking Summary for HG004:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 11000 10953 47 21465 22 10003 15 5 0.995727 0.998081 0.466014 0.996903 NaN NaN 1.792709 2.331949
INDEL PASS 11000 10953 47 21465 22 10003 15 5 0.995727 0.998081 0.466014 0.996903 NaN NaN 1.792709 2.331949
SNP ALL 71659 71428 231 86370 44 14848 10 5 0.996776 0.999385 0.171912 0.998079 2.310073 2.069269 1.878340 1.775156
SNP PASS 71659 71428 231 86370 44 14848 10 5 0.996776 0.999385 0.171912 0.998079 2.310073 2.069269 1.878340 1.775156