Skip to content

How to run Polypolish

Ryan Wick edited this page Jun 7, 2024 · 27 revisions

Quick start

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish polish draft.fasta filtered_1.sam filtered_2.sam > polished.fasta
rm *.amb *.ann *.bwt *.pac *.sa *.sam

Using paired-end short reads

Most short-read sets are paired-end, so this will be the most common way to run Polypolish.

In these commands, I will assume your files are named like this:

  • draft.fasta: the input assembly that you want to polish
  • reads_1.fastq.gz: short reads (first in pair)
  • reads_2.fastq.gz: short reads (second in pair)

The first step is to align the reads to your draft genome with BWA:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam

Important things to note here:

  • The -a option is key! This makes BWA align each read to all possible locations, not just the best location. Polypolish needs this to polish repeat sequences.
  • You have to align the two read files separately. I.e. don't align both read files with a single bwa mem command. This is because BWA's -a option doesn't work with paired-end alignment.
  • I used -t 16 in these commands to align using 16 threads. Adjust this value as is appropriate for your system. This will only affect the speed performance (the alignments themselves are unaffected by thread count).

You can then filter the alignments using Polypolish's insert size filter. This step is optional but recommended:

polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam

Then give the draft genome and the alignments to Polypolish. It will output information to stderr and the polished assembly to stdout, so redirect its output to a file:

polypolish polish draft.fasta filtered_1.sam filtered_2.sam > polished.fasta

Finally, you can clean up the index files (made by bwa index) and alignments to save disk space:

rm *.amb *.ann *.bwt *.pac *.sa *.sam

Using other types of short reads

Polypolish works with any number of SAM files, so unpaired reads are easy:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads.fastq.gz > alignments.sam
polypolish polish draft.fasta alignments.sam > polished.fasta
rm *.amb *.ann *.bwt *.pac *.sa *.sam

And multiple paired-end read sets work too:

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_a_1.fastq.gz > alignments_a_1.sam
bwa mem -t 16 -a draft.fasta reads_a_2.fastq.gz > alignments_a_2.sam
bwa mem -t 16 -a draft.fasta reads_b_1.fastq.gz > alignments_b_1.sam
bwa mem -t 16 -a draft.fasta reads_b_2.fastq.gz > alignments_b_2.sam
polypolish filter --in1 alignments_a_1.sam --in2 alignments_a_2.sam --out1 filtered_a_1.sam --out2 filtered_a_2.sam
polypolish filter --in1 alignments_b_1.sam --in2 alignments_b_2.sam --out1 filtered_b_1.sam --out2 filtered_b_2.sam
polypolish polish draft.fasta filtered_*.sam > polished.fasta

Careful mode

Polypolish is unlikely to introduce errors during polishing, but if you want to be very sure that Polypolish doesn't introduce errors, use --careful:

polypolish polish --careful draft.fasta filtered_1.sam filtered_2.sam > polished.fasta

This will make Polypolish discard any read that aligns to multiple places. This effectively blinds Polypolish to repeat regions of the genome, which are at the highest risk of introduced errors. But it also means that Polypolish cannot fix any errors in repeat regions.

I recommend the use of --careful when read depth is low: 25× or less.

More info on careful mode and relevant benchmarks are available in this paper.

Debug file

Polypolish has a --debug option which you can use to save per-base polishing information in a tab-delimited format. You can use it like this:

polypolish polish --debug polished.tsv draft.fasta filtered_1.sam filtered_2.sam > polished.fasta

The resulting file will have one line for each base of the assembly, so it will be hundreds of megabytes in size for a typical bacterial genome. See the Toy example page for an example of what this file looks like.

Help text

If you run polypolish --help, you will see this top-level help message:

  _____        _                       _  _       _     
 |  __ \      | |                     | |(_)     | |    
 | |__) |___  | | _   _  _ __    ___  | | _  ___ | |__  
 |  ___// _ \ | || | | || '_ \  / _ \ | || |/ __|| '_ \ 
 | |   | (_) || || |_| || |_) || (_) || || |\__ \| | | |
 |_|    \___/ |_| \__, || .__/  \___/ |_||_||___/|_| |_|
                   __/ || |                             
                  |___/ |_|

short-read polishing of long-read assemblies

Usage: polypolish <COMMAND>

Commands:
  filter  filter paired-end alignments based on insert size
  polish  polish a long-read assembly using short-read alignments

Options:
  -h, --help     Print help
  -V, --version  Print version

Run polypolish filter --help to see the help for the filter subcommand:

filter paired-end alignments based on insert size

Usage: polypolish filter [OPTIONS] --in1 <IN1> --in2 <IN2> --out1 <OUT1> --out2 <OUT2>

Options:
      --in1 <IN1>                  Input SAM file - first read in pairs
      --in2 <IN2>                  Input SAM file - first second in pairs
      --out1 <OUT1>                Output SAM file - first read in pairs
      --out2 <OUT2>                Output SAM file - first second in pairs
      --orientation <ORIENTATION>  Expected pair orientation [default: auto]
      --low <LOW>                  Low percentile threshold [default: 0.1]
      --high <HIGH>                High percentile threshold [default: 99.9]
  -h, --help                       Print help
  -V, --version                    Print version

And run polypolish polish --help to see the help for the polish subcommand:

polish a long-read assembly using short-read alignments

Usage: polypolish polish [OPTIONS] <ASSEMBLY> [SAM]...

Arguments:
  <ASSEMBLY>  Assembly to polish (one file in FASTA format)
  [SAM]...    Short read alignments (one or more files in SAM format)

Options:
      --debug <DEBUG>                        Optional file to store per-base information for debugging purposes
  -i, --fraction_invalid <FRACTION_INVALID>  A base must make up less than this fraction of the read depth to be
                                             considered invalid [default: 0.2]
  -v, --fraction_valid <FRACTION_VALID>      A base must make up at least this fraction of the read depth to be
                                             considered valid [default: 0.5]
  -m, --max_errors <MAX_ERRORS>              Ignore alignments with more than this many mismatches and indels
                                             [default: 10]
  -d, --min_depth <MIN_DEPTH>                A base must occur at least this many times in the pileup to be
                                             considered valid [default: 5]
      --careful                              Ignore any reads with multiple alignments
  -h, --help                                 Print help
  -V, --version                              Print version