-
Notifications
You must be signed in to change notification settings - Fork 49
bench
Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs.
truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/
Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison.
For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of pctseq=0
, pctsize=.5
, pctovl=.5
, refdist=1000
may seem fair.
BioGraph and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds.
If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility.
Here is a rundown of each matching parameter.
Parameter | Default | Definition |
---|---|---|
refdist | 500 | Maximum distance comparison calls must be within from base call's start/end |
pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Sequences of Variants" below. |
pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) |
pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) |
typeignore | False | Types don't need to match to compare calls. |
Below are matching parameter diagrams to illustrate (approximately) how they work.
█ = Deletion ^ = Insertion
--refdist REFDIST (500)
Max reference location distance
ACTGATCATGAT
|--████--|
█████
Calls are within reference distance of 2
--pctsize PCTSIZE (0.7)
Min pct allele size similarity
ACTGATCATGA sizes
█████ -> 5bp
████ -> 4bp
variants have 0.8 size similarity
--pctovl PCTOVL (0.0)
Min pct reciprocal overlap
ACTGATCATGA ranges
█████ [2,7)
████ [4,8)
variants have 0.6 reciprocial overlap
--pctseq PCTSEQ (0.7)
Min percent allele sequence similarity
A-CTG-ACTG
^ ^ haplotypes
| └ACTG -> CTGACTGA
└CTGA -> CTGACTGA
haplotypes have 100% sequence similarity
Truvari bench writes the following files to the --output
directory.
File | Description |
---|---|
tp-base.vcf.gz | True positive calls form the base VCF |
tp-comp.vcf.gz | True positive calls from the comparison VCF |
fp.vcf.gz | False positive calls from comparison |
fn.vcf.gz | False negative calls from base |
summary.json | Json output of performance stats |
params.json | Json output of parameters used |
candidate.refine.bed | Bed file of regions for refine
|
log.txt | Run's log |
Stats generated by benchmarking are written to summary.json
.
Metric | Definition |
---|---|
TP-base | Number of matching calls from the base vcf |
TP-comp | Number of matching calls from the comp vcf |
FP | Number of non-matching calls from the comp vcf |
FN | Number of non-matching calls from the base vcf |
precision | TP-comp / (TP-comp + FP) |
recall | TP-base / (TP-base + FN) |
f1 | 2 * ((recall * precision) / (recall + precision)) |
base cnt | Number of calls in the base vcf |
comp cnt | Number of calls in the comp vcf |
TP-comp_TP-gt | TP-comp with genotype match |
TP-comp_FP-gt | TP-comp without genotype match |
TP-base_TP-gt | TP-base with genotype match |
TP-base_FP-gt | TP-base without genotype match |
gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) |
gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match |
weighted | Metrics weighed by variant sequence/size similarity |
The gt_matrix
is a table. For example:
"gt_matrix": {
"(0, 1)": {
"(0, 1)": 500,
"(1, 1)": 10
},
"(1, 1)": {
"(1, 1)": 800,
"(0, 1)": 20
}
}
Represents ->
comp (0,1) (1,1)
base
(0,1) 500 10
(1,1) 20 800
The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory.
Anno | Definition |
---|---|
TruScore | Truvari score for similarity of match. ((pctseq + pctsize + pctovl) / 3 * 100)
|
PctSeqSimilarity | Pct sequence similarity between this variant and its closest match |
PctSizeSimilarity | Pct size similarity between this variant and its closest match |
PctRecOverlap | Percent reciprocal overlap percent of the two calls |
StartDistance | Distance of the base call's start from comparison call's start |
EndDistance | Distance of the base call's end from comparison call's end |
SizeDiff | Difference in size of base and comp calls |
GTMatch | Base/comp calls' Genotypes match |
MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See MatchIds wiki for details. |
As described in the refine wiki, a limitation of Truvari bench is 1-to-1 variant comparison. However, truvari refine
can harmonize the variants to give them more consistent representations. A bed file named candidate.refine.bed
is created by truvari bench
and holds a set of regions which may benefit from refinement. To use it, simply run
truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/
truvari refine --regions result/candidate.refine.bed \
--reference reference.fasta \
--recount --use-region-coords \
result/
See refine wiki for details.
Two SV insertions could theoretically be at the exact same position and exact same size, but have completely different sequences. Truvari's high specificity when matching variants is due in large part to comparing sequences. Truvari uses edlib to estimate sequence similarity of two alleles. In addition to comparing the direct (a.k.a. unmanipulated) similarity of sequences, truvari also checks rotations of sequences. The first rotation is based on the lexicographically minimum rotation of sequences, and the second is 'unrolled' sequences, which is formally described in this gist. These 'rolling' operations can be turned off by using --no-roll
.
The main idea of rotating sequences is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS ab
at POS 2 becomes identical to the INS ba
at POS 1 by rolling 2-1 = 1
bases from the start to the end. Finally, to increase matching sensitivity, the lexicographically minimum rotation is also compared as we've found many instances of smaller (<100bp) insertions which should match but fail direct and unrolled sequence similarity, but are captured in this third rotation.
As of truvari v5.0, the unroll method is the only supported sequence comparison technique
As of truvari v5.0, some symbolic alts can be compared with sequence similarity turned on. <DEL>
, <INV>
, sequences will be filled in when a --reference
is provided. Additionally, <DUP>
variants can have their sequences filled in if --dup-to-ins
is also provided. The duplications are assumed to be perfect copies of the reference range spanned by the SV and are placed as a point insertion at the start position. Alternatively, one can turn off sequence comparison and resolving of symbolic alts by specifying --pctseq 0
.
How many matches a variant is allowed to participate in is controlled by the --pick
parameter. The available pickers are single
, ac
, and multi
.
-
single
(the default option) allows each variant to participate in up to one match. -
ac
uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -
multi
variants can participate in all matches available.
As an example, imagine we have three variants in a pVCF with two samples we want to compare.
CHROM POS ID REF ALT base comp
chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0
chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1
chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1
To compare samples inside the same vcf, we would use the command:
truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a
This VCF makes different results depending on the --pick
parameter
Parameter | ID1 State | ID2 State | ID3 State |
---|---|---|---|
single | TP | FP | FP |
ac | TP | TP | FP |
multi | TP | TP | TP |
Most SV benchmarks only report DEL and INS SVTYPEs. The flag --dup-to-ins
will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, --dup-to-ins
typically should be used without sequence comparison via --pctseq 0
--sizemax
is the maximum size of a base or comparison call to be considered.
--sizemin
is the minimum size of a base call to be considered.
--sizefilt
is the minimum size of a comparison call that will be matched to base calls. It can
be less than sizemin
for edge case variants.
For example: Imagine sizemin
is set at 50 and sizefilt
at 30, and a 50bp base call is 98% similar to a 49bp comparison
call at the same position.
These two calls could be considered matching. However, if we removed comparison calls less than sizemin
,
we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between [sizefilt,sizemin)
to find matches.
This has the side effect of artificially inflating specificity. For example, if that same 49bp call described
above were below the similarity threshold, it would not be classified as a FP since it is below the sizemin
threshold. So we're giving the call a better chance to be useful and less chance to be detrimental
to final statistics.
If an --includebed
is provided, only base and comp calls contained within the defined regions are used
for comparison. This is similar to pre-filtering your base/comp calls using:
(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz
with the exception that Truvari requires the start and the end to be contained in the same includebed region
whereas bedtools intersect
does not.
If an --includebed
is not provided, the comparison is restricted to only the contigs present in the base VCF
header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary
statistics and will not be present in any output vcfs.
The option --extend
extends the regions of interest (set in --includebed
argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer.
See this discussionfor details.
Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.
* zip the Base and Comp calls together in sorted order
* create chunks of all calls overlapping within ±`--chunksize` basepairs
* make a |BaseCall| x |CompCall| match matrix for each chunk
* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds
* if the chunk has no Base or Comp calls
** return them all as FNs/FPs
* use `--pick` method to sort and annotate variants with their best match