Similarity of True Positives #78

ACEnglish · 2021-08-29T21:33:53Z

ACEnglish
Aug 29, 2021
Maintainer

A measure of how accurate SV calls are beyond the precision/recall/concordance is their similarity to the base calls. Truvari reports PctSeqSimilarity and PctSizeSimilarity for True Positives. The higher the similarity between the base and comparison calls, the more accurate representations of the base calls.

Below is the plot of an HG002 sample run through BioGraph and compared to GIAB Tier1 SVs.

import joblib
import truvari
import seaborn as sb
import matplotlib.pyplot as plt
sb.set()
# Load the data made by a `truvari bench --giabreport`. 
# But any `truvari bench` run put through `truvari vcf2df -d` would work
data = joblib.load("bench_rep2/giab.jl")

# Only tpbase and tp have these fields added
tps = data["state"] == "tpbase"

# Scale from 0-1 to 0-100.
data["SizeSimilarity"] = data["PctSizeSimilarity"] * 100
data["SeqSimilarity"] = data["PctSeqSimilarity"] * 100

# The --pctseq --pctsize arguments default to .7.
# This example data plots better if we set the xlim to (90,100)
lower_limit = 90

# Make subplots
fig, axs = plt.subplots(nrows=2)
p = sb.histplot(data=data[tps], x="SeqSimilarity", binwidth=1, 
                hue="svtype", hue_order=["DEL", "INS"], multiple='dodge',
               ax=axs[0])
axs[0].get_xaxis().set_ticklabels([]) # The two rows have shared xaxis scale, so don't plot the first one

# Since we're trimming the xlim, include information about how many we're not plotting
lt_cnt = (data[tps]["SeqSimilarity"] < lower_limit).sum()
p.set(title=f"TP Sequence Similarity ({lt_cnt} < {lower_limit}%)", xlim=(lower_limit, 101), xlabel="")

p = sb.histplot(data=data[tps], x="SizeSimilarity", binwidth=1, 
                hue="svtype", hue_order=["DEL", "INS"], multiple='dodge',
               ax=axs[1])
lt_cnt = (data[tps]["SizeSimilarity"] < lower_limit).sum()
axs[1].legend().remove() # The two rows have the same legend, so don't plot the second one
p.set(title=f"TP Size Similarity ({lt_cnt} < {lower_limit}%)", xlim=(lower_limit, 101), xlabel="Percent")

ACEnglish · 2021-10-18T00:54:29Z

ACEnglish
Oct 18, 2021
Maintainer Author

With the new annotation of FPs as of v3.1.0, we can do even more with this type of analysis.

To start, we need to make the data. Here, we'll compare a BioGraph result with calls from a long-read, haplotype resolved assembly.

truvari bench -c HG00514.grch38.vcf.gz -b  HG00514.strict.vcf.gz -o bench -f grch38.fa --multimatch
truvari vcf2df -b -i bench/ bench/data.jl

Then, inside of a script or a notebook:

import joblib
import pandas as pd
import seaborn as sb

sb.set()
data = joblib.load("bench/data.jl")
p = sb.histplot(data=data.reset_index(), x="PctSeqSimilarity", 
                hue="state", multiple="stack", binwidth=0.05, 
                hue_order=["tp", "fp", "fn"])
p.set(title="Truvari PctSeqSimilarity Distribution")

Note we don't add "tpbase" to the hue_order as these data points can be redundant with "tp".

While a majority of True Positives have >= 95% Sequence Similarity, there are still some False Negatives that have a comparison call with high similarity. Let's investigate one of them:

view_columns = ["svtype", "svlen", "TruScore", "PctSeqSimilarity", "PctSizeSimilarity", "PctRecOverlap", 
                "StartDistance", "EndDistance", "SizeDiff", "GTMatch", "MatchId", "state"]
data[(data["state"] == "fn") & (data["PctSeqSimilarity"] >= 0.95)].head(1)[view_columns]

Returns:

key                     svtype  svlen   TruScore  PctSeqSimilarity  PctSizeSimilarity  PctRecOverlap  StartDistance  EndDistance  SizeDiff  GTMatch  MatchId  state
chr10:1460830-1460899.A DEL     68      73.0      0.976411          1.0                0.0            890.0          890.0        0.0       True     1468.1.0 fn

We can see that while this FN does have a call with high Sequence Similarity (97%). However, the Start/End Distance show that the comparison call is 890bp upstream, which is greater than the --refdist 500 default threshold.

Let's look at the neighborhood of calls within --chunkdist of this FN by using the MatchId 1468.1.0.

data[data["MatchId"].str.startswith("1468.")][view_columns]

Returns:

key                      svtype  svlen  TruScore  PctSeqSimilarity  PctSizeSimilarity  PctRecOverlap  StartDistance  EndDistance  SizeDiff  GTMatch  MatchId   state
chr10:1459908-1459977.C  DEL     68     87.0      0.990566          1.0                0.536232       -32.0          -32.0        0.0       True     1468.0.0  tpbase
chr10:1459940-1460009.T  DEL     68     87.0      0.990566          1.0                0.536232       -32.0          -32.0        0.0       True     1468.0.0  tp
chr10:1460830-1460899.A  DEL     68     73.0      0.976411          1.0                0.0            890.0          890.0        0.0       True     1468.1.0  fn

This shows that our FN is in the neighborhood of a TP pair that's identical. At this point we could investigate whether the FN is a real variant or some sort of assembly/alignment artifact. If it is indeed real (and is the best representation of said variant), we could look for a way to improve the caller. If the FN is suspect, we could remove it from the base VCF (or give it a non-PASS filter); this would then improve our benchmarking results' specificity.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similarity of True Positives #78

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Similarity of True Positives #78

ACEnglish Aug 29, 2021 Maintainer

Replies: 1 comment

ACEnglish Oct 18, 2021 Maintainer Author

ACEnglish
Aug 29, 2021
Maintainer

ACEnglish
Oct 18, 2021
Maintainer Author