Similarity of True Positives #78
Replies: 1 comment
-
With the new annotation of FPs as of v3.1.0, we can do even more with this type of analysis. To start, we need to make the data. Here, we'll compare a BioGraph result with calls from a long-read, haplotype resolved assembly. truvari bench -c HG00514.grch38.vcf.gz -b HG00514.strict.vcf.gz -o bench -f grch38.fa --multimatch
truvari vcf2df -b -i bench/ bench/data.jl Then, inside of a script or a notebook: import joblib
import pandas as pd
import seaborn as sb
sb.set()
data = joblib.load("bench/data.jl")
p = sb.histplot(data=data.reset_index(), x="PctSeqSimilarity",
hue="state", multiple="stack", binwidth=0.05,
hue_order=["tp", "fp", "fn"])
p.set(title="Truvari PctSeqSimilarity Distribution") Note we don't add "tpbase" to the hue_order as these data points can be redundant with "tp". While a majority of True Positives have >= 95% Sequence Similarity, there are still some False Negatives that have a comparison call with high similarity. Let's investigate one of them: view_columns = ["svtype", "svlen", "TruScore", "PctSeqSimilarity", "PctSizeSimilarity", "PctRecOverlap",
"StartDistance", "EndDistance", "SizeDiff", "GTMatch", "MatchId", "state"]
data[(data["state"] == "fn") & (data["PctSeqSimilarity"] >= 0.95)].head(1)[view_columns] Returns:
We can see that while this FN does have a call with high Sequence Similarity (97%). However, the Start/End Distance show that the comparison call is 890bp upstream, which is greater than the Let's look at the neighborhood of calls within data[data["MatchId"].str.startswith("1468.")][view_columns] Returns:
This shows that our FN is in the neighborhood of a TP pair that's identical. At this point we could investigate whether the FN is a real variant or some sort of assembly/alignment artifact. If it is indeed real (and is the best representation of said variant), we could look for a way to improve the caller. If the FN is suspect, we could remove it from the base VCF (or give it a non-PASS filter); this would then improve our benchmarking results' specificity. |
Beta Was this translation helpful? Give feedback.
-
A measure of how accurate SV calls are beyond the precision/recall/concordance is their similarity to the base calls. Truvari reports PctSeqSimilarity and PctSizeSimilarity for True Positives. The higher the similarity between the base and comparison calls, the more accurate representations of the base calls.
Below is the plot of an HG002 sample run through BioGraph and compared to GIAB Tier1 SVs.
Beta Was this translation helpful? Give feedback.
All reactions