Skip to content

Alignment trimming

Ryan Wick edited this page Jan 11, 2022 · 7 revisions

Homopolymer-length errors are the most common type of error in long-read-only assemblies, and so Polypolish contains a little bit of extra logic to help with these: some bases are trimmed off the ends of alignments before generating the pileup. This page uses a small example to illustrate how and why this works.

Here is the reference genome in this example, with an 8×T homopolymer in the middle:

reference

And here is the assembly which contains a deletion, turning the homopolymer into 7×T:

assembly

Polypolish aims to use short reads (which contain the correct 8×T homopolymer) to repair that deletion.

Note that the deletion could be considered to occur anywhere in that homopolymer sequence – deleting any one of the 8 Ts would yield the same result. But since aligners tend to left-align indels (i.e. put indels in their leftmost position), it is appropriate to consider the deletion occurring on the left side of the homopolymer.

Without alignment trimming

Aligning short reads to the assembly yields the following alignments (with the relevant position highlighted in yellow):

alignments without trimming

You can see that some of the alignments have done the correct thing by inserting an extra T (A → AT) at the relevant position. However, many alignments have not. Specifically, reads that end in the homopolymer do not need an insertion to align cleanly.

This ambiguity (some alignments support an insertion and some don't) is reflected in the pileup:

pileup without trimming

At the relevant position, the pileup contains 8 votes for A, 8 votes for AT and 1 vote for T. So Polypolish would not be able to make a decision here, and the deletion error would remain unfixed.

With alignment trimming

To solve this problem, Polypolish trims a few bases off the end of the alignments. Specifically, it does the following:

  • Whatever base is at the end of the alignment is trimmed off, however many times it occurs. So if an alignment ends in GGG, all three Gs are trimmed.
  • One more additional base is trimmed off.

Or if you prefer the logic in Python:

last_base = aligned_bases[-1]
while aligned_bases[-1] == last_base:
    aligned_bases.pop()
aligned_bases.pop()

After applying this logic, the alignments look like this (trimmed bases shown faintly):

alignments with trimming

And the pileup looks like this:

pileup with trimming

At the relevant position, the pileup now contains 8 votes for AT and 1 vote for T. So Polypolish will change the sequence to AT, repairing the error.