Skip to content

Toy example

Ryan Wick edited this page Aug 30, 2021 · 20 revisions

This page will show how Polypolish works using a small and simple example. The genome in this example is only 100-bp long and has a 20-bp two-copy repeat:

reference

Here is the assembly which contains three errors (that we aim to fix):

assembly

And here are the error-free short reads which we will use to polish the assembly:

reads

While this example is tiny, it can still be run in Polypolish, so download the files here if you want to try for yourself: assembly.fasta, reads.fastq, alignments.sam

Read alignment

One alignment per read (the wrong way)

Aligning reads in the normal way (i.e. using default aligner settings) involves putting each read in its single best position. If there are multiple equally good positions, then one is chosen at random. This would result in alignments that look like this:

read alignments (one per read)

This illustrates the problem with that alignment strategy: the error in the first copy of the repeat has caused reads from that part of the genome to align to the second copy of the repeat. This has left no reads aligning over the error.

All alignments (the right way)

If we instead align each read to all possible locations, we get alignments that look like this:

read alignments (all per read)

I've coloured the reads which align to multiple locations in red. Now we have reads aligning over the error in the repeat, and these are the kind of alignments that Polypolish was built to take.

Building a pileup

Using an all-alignments input, Polypolish tallies up the read sequences at each position of the assembly to make a pileup. You can visualise the process as taking the alignment (drawn above) and squashing down each column to remove the empty space:

pileup

Some things to note:

  • Each base of the assembly usually corresponds to a single base from the reads. However, a deletion in the reads relative to the assembly results in no base (shown as -) and an insertion in the reads relative to the assembly results in multiple bases at one position (a squished CT in this example).
  • The shaded green area represents the read depth at each position of the assembly. In non-repetitive regions (where reads align to only one position), the read depth is equivalent to the number of alignments. But for repetitive regions, the read depth is less than the number of alignments. This is because reads with multiple alignments only contribute fractional depth to each of their locations. E.g. when a read aligns to two possible places, it adds 0.5 depth to each of those places.
  • The dotted red lines represent the threshold depths (used in the next step and explained more there). For the invalid threshold depth (bottom dotted red line), Polypolish sets this to 20% of the read depth (adjustable with --fraction_invalid). For the valid threshold depth (top dotted red line), Polypolish sets this to either 5 (adjustable with --min_depth) or half the read depth (adjustable with --fraction_valid), whichever is larger. This toy example has low read depth, so the valid threshold depth is the minimum (5) for many positions of the assembly. In a real situation (with higher read depth), the valid threshold depth would be half the read depth for most positions of the assembly.
  • You might notice that the pileup seems to be missing some sequences which were in the alignment. E.g. a few reads aligned all the way to the end of the assembly, but in the pileup there are two bases at the end with no sequence. This is due to Polypolish's alignment trimming logic.

Fixing errors

Polypolish will assess these conditions for each position in the pileup:

  1. There is one and only valid sequence. A valid sequence is defined as as sequence in the pileup which occurs more times than the valid threshold depth (controlled by --min_depth and --fraction_valid).
  2. Any other sequences are invalid. An invalid sequence occurs fewer times than the invalid threshold depth (controlled by --fraction_invalid).
  3. The read depth is sufficiently high (controlled by --min_depth).
  4. The one valid sequence differs from the assembly base.

The possible outcomes are:

  • If all four conditions are true, that indicates an error in need of repair. Polypolish will change that position of the assembly to the pileup's valid sequence.
  • If conditions 1–3 are true but condition 4 is false,t hat suggests,

Conditions 1–3 ensure that there is one unambiguous sequence at the position in the pileup. If condition 4 is also true, then that position is

When all of those conditions are true, Polypolish will change that position of the assembly (the repair of an error).

In our example, changes occur at three positions:

fixes

These changes correspond to the three errors in the assembly, so the resulting genome is error free!

For positions with no valid sequences (e.g. due to low read depth), Polypolish will make no changes. In our example, this has happened at a few positions, mostly at the start/end of the assembly. For positions with multiple valid read sequences, then Polypolish will make no changes because it doesn't know which of the alternatives to use. This hasn't occurred in our toy example here – it would be most common when the sequence has an inexact repeat. Since Polypolish only makes a change in unambiguous cases, it is unlikely to introduce an error into the assembly genome. So you can be reasonably sure that the output of Polypolish is at least no worse than the input. It is a 'do no harm' polishing strategy (a term I'm borrowing from this paper).

Debug output

If you are interested in per-base polishing information, you can use Polypolish's --debug option to save a tab-delimited table to file. For each position of the assembly, you can see the assembly base, depth, threshold depth, read sequence pileup, status and new base.

The status can be:

  • low_depth: depth is below --min_depth
  • none: no valid options
  • multiple: more than one valid option
  • too_close: one or more options are neither valid nor invalid
  • kept: one valid option which matches the assembly base
  • changed: one valid option which differs from the assembly base

Here is what this file looks like for our toy example:

name   pos  base  depth  invalid  valid  pileup        status     new_base
draft  0    G     1.0    0        5      Gx1           low_depth  G
draft  1    A     2.0    0        5      Ax2           low_depth  A
draft  2    C     2.0    0        5      Cx2           low_depth  C
draft  3    T     2.0    0        5      Tx2           low_depth  T
draft  4    G     3.0    1        5      Gx3           low_depth  G
draft  5    T     4.0    1        5      Tx4           low_depth  T
draft  6    T     4.0    1        5      Tx4           low_depth  T
draft  7    C     4.0    1        5      Cx4           low_depth  C
draft  8    A     5.0    1        5      Ax5           kept       A
draft  9    A     5.0    1        5      Ax5           kept       A
draft  10   C     6.0    1        5      Cx6           kept       C
draft  11   G     9.0    2        5      Gx9           kept       G
draft  12   C     11.0   2        6      Cx11          kept       C
draft  13   G     11.0   2        6      Gx11          kept       G
draft  14   A     12.0   2        6      Ax13          kept       A
draft  15   T     11.5   2        6      Tx13          kept       T
draft  16   C     10.5   2        5      -x2,Cx10      too_close  C
draft  17   G     11.5   2        6      -x2,Gx11,Tx2  too_close  G
draft  18   C     12.5   2        6      Cx18          kept       C
draft  19   T     10.0   2        5      Tx16          kept       T
draft  20   G     7.5    2        5      Gx14          kept       G
draft  21   T     7.5    2        5      Tx15          kept       T
draft  22   A     7.0    1        5      Ax14          kept       A
draft  23   T     8.0    2        5      Tx16          kept       T
draft  24   T     8.5    2        5      Tx17          kept       T
draft  25   C     7.5    2        5      Cx15          kept       C
draft  26   A     10.5   2        5      Ax21          kept       A
draft  27   G     9.5    2        5      Cx19          changed    C
draft  28   C     9.0    2        5      Cx18          kept       C
draft  29   A     9.5    2        5      Ax19          kept       A
draft  30   A     10.0   2        5      Ax20          kept       A
draft  31   G     11.0   2        6      Gx22          kept       G
draft  32   G     9.5    2        5      Gx18          kept       G
draft  33   T     12.5   2        6      Tx21          kept       T
draft  34   T     9.0    2        5      Tx14          kept       T
draft  35   C     7.0    1        5      Cx10          kept       C
draft  36   T     9.0    2        5      Tx12          kept       T
draft  37   T     9.5    2        5      Tx11          kept       T
draft  38   C     10.0   2        5      Cx10          kept       C
draft  39   G     11.0   2        6      Gx11          kept       G
draft  40   G     10.0   2        5      Gx10          kept       G
draft  41   C     11.0   2        6      Cx11          kept       C
draft  42   C     14.0   3        7      Cx14          kept       C
draft  43   T     14.0   3        7      Tx14          kept       T
draft  44   T     12.0   2        6      Tx12          kept       T
draft  45   C     12.0   2        6      Cx12          kept       C
draft  46   A     10.0   2        5      Ax10          kept       A
draft  47   C     9.0    2        5      Cx9           kept       C
draft  48   G     6.0    1        5      Gx6           kept       G
draft  49   T     7.0    1        5      Tx7           kept       T
draft  50   A     4.0    1        5      Ax4           low_depth  A
draft  51   C     6.0    1        5      CTx6          changed    CT
draft  52   C     5.0    1        5      Cx5           kept       C
draft  53   T     5.0    1        5      Tx5           kept       T
draft  54   G     6.0    1        5      Gx6           kept       G
draft  55   C     8.0    2        5      Cx8           kept       C
draft  56   G     7.0    1        5      Gx7           kept       G
draft  57   C     7.0    1        5      Cx7           kept       C
draft  58   T     5.5    1        5      Tx6           kept       T
draft  59   A     6.5    1        5      Ax7,Cx1       too_close  A
draft  60   T     7.5    2        5      Gx3,Tx8       too_close  T
draft  61   C     9.5    2        5      Cx15          kept       C
draft  62   T     9.0    2        5      Tx15          kept       T
draft  63   G     7.5    2        5      Gx14          kept       G
draft  64   T     7.5    2        5      Tx15          kept       T
draft  65   A     8.0    2        5      Ax15          kept       A
draft  66   T     9.0    2        5      Tx17          kept       T
draft  67   T     9.5    2        5      Tx18          kept       T
draft  68   C     8.5    2        5      Cx16          kept       C
draft  69   A     9.5    2        5      Ax18          kept       A
draft  70   C     10.5   2        5      Cx20          kept       C
draft  71   C     10.0   2        5      Cx19          kept       C
draft  72   A     10.5   2        5      Ax20          kept       A
draft  73   A     10.0   2        5      Ax20          kept       A
draft  74   G     11.0   2        6      Gx22          kept       G
draft  75   G     9.5    2        5      Gx18          kept       G
draft  76   T     11.5   2        6      Tx20          kept       T
draft  77   T     11.0   2        6      Tx16          kept       T
draft  78   C     10.0   2        5      Cx13          kept       C
draft  79   T     11.0   2        6      Tx14          kept       T
draft  80   T     11.0   2        6      Tx13          kept       T
draft  81   A     9.5    2        5      Ax10          kept       A
draft  82   A     9.0    2        5      Ax9           kept       A
draft  83   C     9.0    2        5      Cx9           kept       C
draft  84   T     10.0   2        5      Tx10          kept       T
draft  85   A     8.0    2        5      -x8           changed    -
draft  86   C     5.0    1        5      Cx5           kept       C
draft  87   G     5.0    1        5      Gx5           kept       G
draft  88   T     4.0    1        5      Tx4           low_depth  T
draft  89   G     3.0    1        5      Gx3           low_depth  G
draft  90   T     7.0    1        5      Tx7           kept       T
draft  91   G     7.0    1        5      Gx7           kept       G
draft  92   G     6.0    1        5      Gx6           kept       G
draft  93   A     5.0    1        5      Ax5           kept       A
draft  94   G     5.0    1        5      Gx5           kept       G
draft  95   C     4.0    1        5      Cx4           low_depth  C
draft  96   T     4.0    1        5      Tx4           low_depth  T
draft  97   G     4.0    1        5      Gx4           low_depth  G
draft  98   C     0.0    0        5                    low_depth  C
draft  99   G     0.0    0        5                    low_depth  G