-
Notifications
You must be signed in to change notification settings - Fork 48
bench
Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs.
truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/
Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison.
For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of pctseq=0
, pctsize=.5
, pctovl=.5
, refdist=1000
may seem fair.
BioGraph and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds.
If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt).
Here is a rundown of each matching parameter.
Parameter | Default | Definition |
---|---|---|
refdist | 500 | Maximum distance comparison calls must be within from base call's start/end |
pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Sequences of Variants" below. |
pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) |
pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) |
typeignore | False | Types don't need to match to compare calls. |
Below are matching parameter diagrams to illustrate (approximately) how they work.
█ = Deletion ^ = Insertion
--refdist REFDIST (500)
Max reference location distance
ACTGATCATGAT
|--████--|
█████
Calls are within reference distance of 2
--pctsize PCTSIZE (0.7)
Min pct allele size similarity
ACTGATCATGA sizes
█████ -> 5bp
████ -> 4bp
variants have 0.8 size similarity
--pctovl PCTOVL (0.0)
Min pct reciprocal overlap
ACTGATCATGA ranges
█████ [2,7)
████ [4,8)
variants have 0.6 reciprocial overlap
--pctseq PCTSEQ (0.7)
Min percent allele sequence similarity
A-CTG-ACTG
^ ^ haplotypes
| └ACTG -> CTGACTGA
└CTGA -> CTGACTGA
haplotypes have 100% sequence similarity
Truvari bench creates an output directory with the following files.
File | Description |
---|---|
tp-base.vcf | Annotated true positive calls form the base VCF |
tp-comp.vcf | Annotated true positive calls from the comparison VCF |
fp.vcf | False positive calls from comparison |
fn.vcf | False negative calls from base |
summary.json | Json output of performance stats |
params.json | Json output of parameters used |
log.txt | Run's logging output |
The following stats are generated by benchmarking and written to summary.json as a json dump.
Metric | Definition |
---|---|
TP-base | Number of matching calls from the base vcf |
TP-comp | Number of matching calls from the comp vcf |
FP | Number of non-matching calls from the comp vcf |
FN | Number of non-matching calls from the base vcf |
precision | TP-comp / (TP-comp + FP) |
recall | TP-base / (TP-base + FN) |
f1 | 2 * ((recall * precision) / (recall + precision)) |
base cnt | Number of calls in the base vcf |
comp cnt | Number of calls in the comp vcf |
TP-comp_TP-gt | TP-comp with genotype match |
TP-comp_FP-gt | TP-comp without genotype match |
TP-base_TP-gt | TP-base with genotype match |
TP-base_FP-gt | TP-base without genotype match |
gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) |
gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match |
The output vcfs are annotated with INFO fields described above and then sorted, compressed, and indexed inside of the output directory.
Anno | Definition |
---|---|
TruScore | Truvari score for similarity of match. ((pctseq + pctsize + pctovl) / 3 * 100)
|
PctSeqSimilarity | Pct sequence similarity between this variant and its closest match |
PctSizeSimilarity | Pct size similarity between this variant and its closest match |
PctRecOverlap | Percent reciprocal overlap percent of the two calls |
StartDistance | Distance of the base call's start from comparison call's start |
EndDistance | Distance of the base call's end from comparison call's end |
SizeDiff | Difference in size of base and comp calls |
GTMatch | Base/comp calls' Genotypes match |
MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See MatchIds wiki for details. |
While many SV overlapping approaches match variants with metrics such as reciprocal overlap and start/end distances, few account for the possibility to represent the same sequence change in multiple ways.
Truvari has implemented two approaches to compare variant sequences. The original method is to use the reference context of a pair of variants . The new method is called 'unroll'.
For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this variant would be:
#POS INS Haplotype
0 ab abAB
1 ba AbaB
2 ab ABab
Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls'
min(starts):max(ends)
before making the the sequence change introduced by the variants. In python, this line
looks like:
hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq
Where a1_seq1
is the longer of the REF or ALT allele.
The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see this gist.
The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in our 'ab' example above, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS ab
at POS 2 becomes identical to the INS ba
at POS 1 by rolling 2-1 = 1
bases from the start to the end.
This unroll method has a number of benefits and a couple of considerations.
- Unroll does not need a
--reference
for comparison, which saves I/O time. - In the limited testing performed so far, unroll increase the number of matching SVs
- Unroll seems to decrease the number of 'suspect' matches in smaller size regimes
- The pattern between PctSizeSimilarity and PctSeqSimilarity is simpler
- The PctSeqSimilarity of variant pairs reported by the reference context method is higher than the unroll method. This is mainly due to reference context 'inflating' the similarity (i.e. adding identical sequence to both variants in the pair naturally increases their identity). However, a majority of pairs seem to have identical similarity reported by either method.
If the base or comp vcfs do not have sequence resolved calls, simply set --pctseq=0
to turn off
sequence comparison. The --reference
does not need to be set when not using sequence comparison.
How many matches a variant is allowed to participate in is controlled by the --pick
parameter. The available pickers are single
, ac
, and multi
.
-
single
(the default option) allows each variant to participate in up to one match. -
ac
uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (it's GT is 1/1 so AC==2). A heterozygous variant can only participate in one match (GT 0/1 = AC 1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. GT ./1 = AC 1. -
multi
variants can participate in as many matches as they want.
As an example, imagine we have three variants in a pVCF with two samples we want to compare.
CHROM POS ID REF ALT base comp
chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0
chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1
chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1
To compare samples inside the same vcf, we'd use the command:
truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a
This VCF makes different results depending on the --pick
parameter
Parameter | ID1 State | ID2 State | ID3 State |
---|---|---|---|
single | TP | FP | FP |
ac | TP | TP | FP |
multi | TP | TP | TP |
Most SV benchmarks only report DEL and INS SVTYPEs. The flag --dup-to-ins
will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, --dup-to-ins
typically should be used without sequence comparison via --pctseq 0
--sizemin
is the minimum size of a base call to be considered.
--sizefilt
is the minimum size of a comparison call that will be matched to base calls. It should
be less than sizemin
for edge case variants.
For example: sizemin
is 50 and sizefilt
is 30. A 50bp base call is 98% similar to a 49bp comparison
call at the same position.
These two calls should be considered matching. If we instead removed comparison calls less than sizemin
,
we'd incorrectly classify the 50bp base call as a false negative.
This does have the side effect of artificially inflating specificity. If that same 49bp call in the
above were below the similarity threshold, it would not be classified as a FP due to the sizemin
threshold. So we're giving the call a better chance to be useful and less chance to be detrimental
to final statistics.
If an --includebed
is provided, only base and comp calls contained within the defined regions are used
for comparison. This is similar to pre-filtering your base/comp calls using:
(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz
with the exception that Truvari requires the start and the end to be contained in the same includebed region
whereas bedtools intersect
does not.
If an --includebed
is not provided, the comparison is restricted to only the contigs present in the base VCF
header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary
statistics and will not be present in any output vcfs.
The option --extend
extends the regions of interest (set in --includebed
argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer.
See this discussionfor details.
Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.
* zip the Base and Comp calls together in sorted order
* create chunks of all calls overlapping within ±`--chunksize` basepairs
* make a |BaseCall| x |CompCall| match matrix for each chunk
* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds
* if the chunk has no Base or Comp calls
** return them all as FNs/FPs
* use `--pick` method to sort and annotate variants with their best match