-
Notifications
You must be signed in to change notification settings - Fork 10
Alignment trimming
Homopolymer-length errors are the most common type of error in long-read-only assemblies, and so Polypolish contains a little bit of extra logic to help with these: some bases are trimmed off the ends of alignments before generating the pileup. This page uses a small example to illustrate how and why this works.
Here is the reference genome in this example, with an 8×T homopolymer in the middle:
And here is the assembly which contains a deletion, turning the homopolymer into 7×T:
Polypolish aims to use short reads (which contain the correct 8×T homopolymer) to repair that deletion.
Note that the deletion could be considered to occur anywhere in that homopolymer sequence – deleting any one of the 8 Ts would yield the same result. But since aligners tend to left-align indels (i.e. put indels in their leftmost position), it is appropriate to consider the deletion occurring on the left side of the homopolymer.
Aligning short reads to the assembly yields the following alignments (with the relevant position highlighted in yellow):
You can see that some of the alignments have done the correct thing by inserting an extra T (A → AT) at the relevant position. However, many alignments have not. Specifically, reads that end in the homopolymer do not need an insertion to align cleanly.
This ambiguity (some alignments support an insertion and some don't) is reflected in the pileup:
At the relevant position, the pileup contains 8 votes for A, 8 votes for AT and 1 vote for T. So Polypolish would not be able to make a decision here, and the deletion error would remain unfixed.
To solve this problem, Polypolish trims a few bases off the end of the alignments. Specifically, it does the following:
- Whatever base is at the end of the alignment is trimmed off, however many times it occurs. So if an alignment ends in GGG, all three Gs are trimmed.
- One more additional base is trimmed off.
Or if you prefer the logic in Python:
last_base = aligned_bases[-1]
while aligned_bases[-1] == last_base:
aligned_bases.pop()
aligned_bases.pop()
After applying this logic, the alignments look like this (trimmed bases shown faintly):
And the pileup looks like this:
At the relevant position, the pileup now contains 8 votes for AT and 1 vote for T. So Polypolish will change the sequence to AT, repairing the error.