-
Notifications
You must be signed in to change notification settings - Fork 10
How to run Polypolish
bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish polish draft.fasta filtered_1.sam filtered_2.sam > polished.fasta
rm *.amb *.ann *.bwt *.pac *.sa *.sam
Most short-read sets are paired-end, so this will be the most common way to run Polypolish.
In these commands, I will assume your files are named like this:
-
draft.fasta
: the input assembly that you want to polish -
reads_1.fastq.gz
: short reads (first in pair) -
reads_2.fastq.gz
: short reads (second in pair)
The first step is to align the reads to your draft genome with BWA:
bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam
Important things to note here:
- The
-a
option is key! This makes BWA align each read to all possible locations, not just the best location. Polypolish needs this to polish repeat sequences. - You have to align the two read files separately. I.e. don't align both read files with a single
bwa mem
command. This is because BWA's-a
option doesn't work with paired-end alignment. - I used
-t 16
in these commands to align using 16 threads. Adjust this value as is appropriate for your system. This will only affect the speed performance (the alignments themselves are unaffected by thread count).
You can then filter the alignments using Polypolish's insert size filter. This step is optional but recommended:
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
Then give the draft genome and the alignments to Polypolish. It will output information to stderr and the polished assembly to stdout, so redirect its output to a file:
polypolish polish draft.fasta filtered_1.sam filtered_2.sam > polished.fasta
Finally, you can clean up the index files (made by bwa index
) and alignments to save disk space:
rm *.amb *.ann *.bwt *.pac *.sa *.sam
Polypolish works with any number of SAM files, so unpaired reads are easy:
bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads.fastq.gz > alignments.sam
polypolish polish draft.fasta alignments.sam > polished.fasta
rm *.amb *.ann *.bwt *.pac *.sa *.sam
And multiple paired-end read sets work too:
bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_a_1.fastq.gz > alignments_a_1.sam
bwa mem -t 16 -a draft.fasta reads_a_2.fastq.gz > alignments_a_2.sam
bwa mem -t 16 -a draft.fasta reads_b_1.fastq.gz > alignments_b_1.sam
bwa mem -t 16 -a draft.fasta reads_b_2.fastq.gz > alignments_b_2.sam
polypolish filter --in1 alignments_a_1.sam --in2 alignments_a_2.sam --out1 filtered_a_1.sam --out2 filtered_a_2.sam
polypolish filter --in1 alignments_b_1.sam --in2 alignments_b_2.sam --out1 filtered_b_1.sam --out2 filtered_b_2.sam
polypolish polish draft.fasta filtered_*.sam > polished.fasta
Polypolish is unlikely to introduce errors during polishing, but if you want to be very sure that Polypolish doesn't introduce errors, use --careful
:
polypolish polish --careful draft.fasta filtered_1.sam filtered_2.sam > polished.fasta
This will make Polypolish discard any read that aligns to multiple places. This effectively blinds Polypolish to repeat regions of the genome, which are at the highest risk of introduced errors. But it also means that Polypolish cannot fix any errors in repeat regions.
I recommend the use of --careful
when read depth is low: 25× or less.
More info on careful mode and relevant benchmarks are available in this paper.
Polypolish has a --debug
option which you can use to save per-base polishing information in a tab-delimited format. You can use it like this:
polypolish polish --debug polished.tsv draft.fasta filtered_1.sam filtered_2.sam > polished.fasta
The resulting file will have one line for each base of the assembly, so it will be hundreds of megabytes in size for a typical bacterial genome. See the Toy example page for an example of what this file looks like.
If you run polypolish --help
, you will see this top-level help message:
_____ _ _ _ _
| __ \ | | | |(_) | |
| |__) |___ | | _ _ _ __ ___ | | _ ___ | |__
| ___// _ \ | || | | || '_ \ / _ \ | || |/ __|| '_ \
| | | (_) || || |_| || |_) || (_) || || |\__ \| | | |
|_| \___/ |_| \__, || .__/ \___/ |_||_||___/|_| |_|
__/ || |
|___/ |_|
short-read polishing of long-read assemblies
Usage: polypolish <COMMAND>
Commands:
filter filter paired-end alignments based on insert size
polish polish a long-read assembly using short-read alignments
Options:
-h, --help Print help
-V, --version Print version
Run polypolish filter --help
to see the help for the filter subcommand:
filter paired-end alignments based on insert size
Usage: polypolish filter [OPTIONS] --in1 <IN1> --in2 <IN2> --out1 <OUT1> --out2 <OUT2>
Options:
--in1 <IN1> Input SAM file - first read in pairs
--in2 <IN2> Input SAM file - first second in pairs
--out1 <OUT1> Output SAM file - first read in pairs
--out2 <OUT2> Output SAM file - first second in pairs
--orientation <ORIENTATION> Expected pair orientation [default: auto]
--low <LOW> Low percentile threshold [default: 0.1]
--high <HIGH> High percentile threshold [default: 99.9]
-h, --help Print help
-V, --version Print version
And run polypolish polish --help
to see the help for the polish subcommand:
polish a long-read assembly using short-read alignments
Usage: polypolish polish [OPTIONS] <ASSEMBLY> [SAM]...
Arguments:
<ASSEMBLY> Assembly to polish (one file in FASTA format)
[SAM]... Short read alignments (one or more files in SAM format)
Options:
--debug <DEBUG> Optional file to store per-base information for debugging purposes
-i, --fraction_invalid <FRACTION_INVALID> A base must make up less than this fraction of the read depth to be
considered invalid [default: 0.2]
-v, --fraction_valid <FRACTION_VALID> A base must make up at least this fraction of the read depth to be
considered valid [default: 0.5]
-m, --max_errors <MAX_ERRORS> Ignore alignments with more than this many mismatches and indels
[default: 10]
-d, --min_depth <MIN_DEPTH> A base must occur at least this many times in the pileup to be
considered valid [default: 5]
--careful Ignore any reads with multiple alignments
-h, --help Print help
-V, --version Print version