Skip to content
Adam English edited this page Jan 17, 2023 · 72 revisions

Quick start

Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs.

truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/

Outputs

Truvari bench creates an output directory with the following files.

File Description
tp-base.vcf Annotated true positive calls form the base VCF
tp-comp.vcf Annotated true positive calls from the comparison VCF
fp.vcf False positive calls from comparison
fn.vcf False negative calls from base
summary.json Json output of performance stats
params.json Json output of parameters used
log.txt Run's logging output

summary.json

The following stats are generated by benchmarking and written to summary.json as a json dump.

Metric Definition
TP-base Number of matching calls from the base vcf
TP-comp Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-comp / (TP-comp + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
comp cnt Number of calls in the comp vcf
TP-comp_TP-gt TP-comp with genotype match
TP-comp_FP-gt TP-comp without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match

Matching Parameters

Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison.

For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of pctseq=0, pctsize=.5, pctovl=.5, refdist=1000 may seem fair.

BioGraph and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds.

If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt).

Here is a rundown of each matching parameter.

Parameter Default Definition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctseq 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.

Matching Parameter Diagrams

Below are matching parameter diagrams to illustrate (approximately) how they work.

 █ = Deletion ^ = Insertion

--refdist REFDIST (500)
  Max reference location distance

    ACTGATCATGAT
     |--████--|    
          █████      
  
  Calls are within reference distance of 2

--pctsize PCTSIZE (0.7)
  Min pct allele size similarity

    ACTGATCATGA    sizes
      █████     -> 5bp
        ████    -> 4bp

  variants have 0.8 size similarity


--pctovl PCTOVL (0.0)
  Min pct reciprocal overlap

    ACTGATCATGA  ranges
      █████      [2,7)
        ████     [4,8)

  variants have 0.6 reciprocial overlap


--pctseq PCTSEQ (0.7)
  Min percent allele sequence similarity

    A-CTG-ACTG
     ^   ^       haplotypes
     |   └ACTG -> CTGACTGA
     └CTGA     -> CTGACTGA

  haplotypes have 100% sequence similarity

Comparing VCFs without sequence resolved calls

If the base or comp vcfs do not have sequence resolved calls, simply set --pctseq=0 to turn off sequence comparison. The --reference does not need to be set when not using sequence comparison.

Difference between --sizemin and --sizefilt

--sizemin is the minimum size of a base call to be considered.

--sizefilt is the minimum size of a comparison call that will be matched to base calls. It should be less than sizemin for edge case variants.

For example: sizemin is 50 and sizefilt is 30. A 50bp base call is 98% similar to a 49bp comparison call at the same position.

These two calls should be considered matching. If we instead removed comparison calls less than sizemin, we'd incorrectly classify the 50bp base call as a false negative.

This does have the side effect of artificially inflating specificity. If that same 49bp call in the above were below the similarity threshold, it would not be classified as a FP due to the sizemin threshold. So we're giving the call a better chance to be useful and less chance to be detrimental to final statistics.

Definition of annotations added to vcfs

Anno Definition
TruScore Truvari score for similarity of match. ((pctseq + pctsize + pctovl) / 3 * 100)
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's start from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki

Include Bed & VCF Header Contigs

If an --includebed is provided, only base and comp calls contained within the defined regions are used for comparison. This is similar to pre-filtering your base/comp calls using:

(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz

with the exception that Truvari requires the start and the end to be contained in the same includebed region whereas bedtools intersect does not.

If an --includebed is not provided, the comparison is restricted to only the contigs present in the base VCF header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary statistics and will not be present in any output vcfs.

Extending an Include Bed

The option --extend extends the regions of interest (set in --includebed argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer.

See this discussionfor details.

Comparing Sequences of Variants

While many SV overlapping approaches match variants with metrics such as reciprocal overlap and start/end distances, few account for the possibility to represent the same sequence change in multiple ways.

Truvari has implemented two approaches to compare variant sequences. The original method is to use the reference context of a pair of variants . The new method is called 'unroll'.

Reference context

For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this variant would be:

#POS INS  Haplotype
  0  ab   abAB
  1  ba   AbaB
  2  ab   ABab

Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' min(starts):max(ends) before making the the sequence change introduced by the variants. In python, this line looks like:

hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq

Where a1_seq1 is the longer of the REF or ALT allele.

Unroll

The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see this gist.

The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in our 'ab' example above, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS ab at POS 2 becomes identical to the INS ba at POS 1 by rolling 2-1 = 1 bases from the start to the end.

This unroll method has a number of benefits and a couple of considerations.

  • Unroll does not need a --reference for comparison, which saves I/O time.
  • In the limited testing performed so far, unroll increase the number of matching SVs
  • Unroll seems to decrease the number of 'suspect' matches in smaller size regimes
  • The pattern between PctSizeSimilarity and PctSeqSimilarity is simpler
  • The PctSeqSimilarity of variant pairs reported by the reference context method is higher than the unroll method. This is mainly due to reference context 'inflating' the similarity (i.e. adding identical sequence to both variants in the pair naturally increases their identity). However, a majority of pairs seem to have identical similarity reported by either method.

Methodology

Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.

* zip the Base and Comp calls together in sorted order
* create chunks of all calls overlapping within ±`--chunksize` basepairs
* make a |BaseCall| x |CompCall| match matrix for each chunk
* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds 
* if the chunk has no Base or Comp calls
** return them all as FNs/FPs 
* use `--pick` method to sort and annotate variants with their best match

--dup-to-ins

Most SV benchmarks only report DEL and INS SVTYPEs. The flag --dup-to-ins will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, --dup-to-ins typically should be used without sequence comparison via --pctseq 0

Controlling the number of matches

How many matches a variant is allowed to participate in is controlled by the --pick parameter. The available pickers are single, gtcomp, and multi.

  • single (the default option) allows each variant to participate in up to one match.
  • ac uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (it's GT is 1/1 so AC==2). A heterozygous variant can only participate in one match (GT 0/1 = AC 1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. GT ./1 = AC 1.
  • multi variants can participate in as many matches as they want.

As an example, imagine we have three variants in a pVCF with two samples we want to compare.

CHROM POS      ID  REF ALT           base comp
chr20 17785968 ID1 A   ACGCGCGCGCG   1/1  1/0
chr20 17785968 ID2 A   ACGCGCGCGCGCG 0/0  0/1
chr20 17785969 ID3 C   CGCGCGCGCGCGC 0/0  1/1

To compare samples inside the same vcf, we'd use the command:

truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a

With single, ID1 from base would match to ID1 from comp, and ID2 and ID3 would be false positives. With ac ID1 is a TP for both base and comp while ID2 would become a true positive. With multi, all variants have a corresponding true positive match.

Clone this wiki locally