Skip to content
Adam English edited this page Aug 25, 2022 · 45 revisions

Truvari phab

Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can give rise to identical haplotypes through radically different variant representations. We could dramtically lower our thresholds to identify the match, but this would cause many variants from unidentical alleles to be falsely matched.

This problem is easiest to conceptualize in the case of 'split' variants: imagine a single 100bp DEL that can be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to --multimatch --pctsim 0.50 --pctsize 0.50. And if the variants weren't describing 100% identical alleles (say a caller erroneously deleted an extra base to make a 101bp DEL) we would have to lower our thresholds even further.

So how do we deal with inconsistent representations? Perhaps we could build dynamic-thresholds that would raise or lower depending on measurements (e.g. variant distance, sequence context), but building a reliable model would require an expansive training dataset. Instead, a simpler approach would be to get rid of inconsistent representations.

truvari phab is designed to remove variant representation inconsistencies by reconstructing haplotypes from variants, remapping the haplotype, and recalling variants. By putting the sequences through a harmonized pipeline, we expect to remove discordance between variant representations and simplify the work needed to perform variant comparison.

truvari phab works by wrapping a number of trusted bioinformatics tools. Haplotypes from a region are reconstructed from VCFs using bcftools consensus. The sequences are then aligned using MAFFT - a fast multiple-sequence alignment software. Truvari parses the resultant MSA and calls variants to reproduce the region's VCF.

Requirements

Since truvari phab uses existing tools, it expects them to be found in the environment path. The list of tools phab will call are:

  • bcftools 1.10.2
  • vcf-sort 0.1.16
  • tabix / bgzip - 1.10.2-3
  • samtools 1.10
  • mafft v7.505

You can install each of these or you can can build a Truvari Docker container

Example

As an example, we'll use Truvari's test files in repo_utils/test_files/phab* which were created from real data over GRCh38 chr1:26399065-26401053.

  • phab_base.vcf.gz - an 86 sample squared-off pVCF
  • phab_comp.vcf.gz - a single sample's VCF
  • phab_ref.fa - a subset of the GRCh38 reference

This dataset is interesting because the HG002 sample in phab_base.vcf.gz uses the same sequencing experiment (HPRC) as the sample syndip in phab_comp.vcf.gz, but with a different pipeline. And as we will see, the pipeline can make all the difference.

To start, let's use truvari bench to see how similar the variant calls are in this region.

truvari bench --base phab_base.vcf.gz \
	--comp phab_comp.vcf.gz \
	--sizemin 1 --sizefilt 1 \
	--bSample HG002 \
	--cSample syndip \
	--unroll \
	--no-ref a \
	--output initial_bench

This will compare all variants greater than 1bp ( -S 1 -s 1 which includes SNPs) from the HG002 sample to the syndip sample, with the --unroll approach to calculating sequence similarity. We're also excluding any uncalled or reference homozygous sites with --no-ref a. The report in initial_bench/summary.txt shows:

{
    "TP-base": 5,
    "TP-call": 5,
    "FP": 2,
    "FN": 22,
    "precision": 0.7142857142857143,
    "recall": 0.18518518518518517,
    "f1": 0.2941176470588235,
}

These variants are pretty poorly matched, especially considering the HG002 and syndip samples are using the same sequencing experiment. Now, let's use truvari phab to harmonize the variants.

truvari phab --base phab_base.vcf.gz \
	--comp phab_comp.vcf.gz \
	--bSample HG002 \
	--cSample syndip \
	--reference phab_ref.fa \
	--region chr1:700-900 \
	-o phab_result/

Note that we specify the --region as 700bp-900bp because our test files are subsetted to the smaller region.

The main file phab produces we're concerned with is the phab_result/output.vcf.gz. In it we can see there are now only 9 variants. Let's run truvari bench again on the output to see how well the variants match up after running phab.

truvari bench -b phab_result/output.vcf.gz \
	-c phab_result/output.vcf.gz \
	-S 1 -s 1 \
	--unroll  \
	--no-ref a \
	--bSample HG002 \
	--cSample syndip \
	-o harmonized_bench/

Looking at harmonized_bench/summary.txt shows:

{
    "TP-base": 8,
    "TP-call": 8,
    "FP": 0,
    "FN": 0,
    "precision": 1.0,
    "recall": 1.0,
    "f1": 1.0,
}

Now there is no difference between our two sets of variants. For this variant call-set, truvri phab makes truvari bench overkill since the variants create identical haplotypes. We can benchmark simply by counting the genotypes.

$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c
      1 0/1 1/0
      1 1/0 0/1
      6 1/1 1/1

(We can ignore the phasing differences ((0/1 vs. 1/0). These pipelines reported the parental alleles in a different order)

MSA

If you read the truvari phab --help , you may have noticed that the --comp VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run phab on the repo_utils/test_files/phab_base.vcf.gz

truvari phab -b phab_base.vcf.gz \
	-f phab_ref.fa \
	-r chr1:700-900 \
	-o msa_example

As a simple check, we can count the number of variants before/after phab:

zgrep -vc "#" phab_base.vcf.gz
zgrep -vc "#" msa_example/output.vcf.gz

The 278 original variants inputted into phab became just 68.

Better yet, these fewer variants occur on fewer positions:

bcftools query -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l
bcftools query -f "%POS\n" msa_example/output.vcf.gz | sort | uniq | wc -l

This returns that the variants were over 197 positions but now sit at just 17

We can also observe changes in the allele frequency after running phab:

bcftools +fill-tags  phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c | head -n5
bcftools +fill-tags  msa_example/output.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c | head -n5

The allele-count (AC) shows a 55% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele.

 original   phab
   80 1     36 1
   24 2      5 2
    7 3      2 3
   11 4      3 4
    5 5      2 5
         ...
    1 114    1 35
    1 132    1 40
    1 150    1 53
    1 163    1 56
    1 164    1 81

(TODO: pull the adotto TR region annotations and run this example through truvari anno trf. I bet we'll get a nice spectrum of copy-diff of the same motif in the phab calls.)

Limitations

  • MAFFT, while very fast, is impractical for very long sequences and maybe impossible for entire human chromosomes. Therefore, truvari phab is recommended to only be run on sub-regions.
  • By giving the variants new representations, it becomes harder to count how many TP/FP calls the original pipeline created.
  • Early testing on phab is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try unphased variants out, but proceed with caution.
Clone this wiki locally