Skip to content

Manual inspection of results

Ryan Wick edited this page Apr 30, 2024 · 5 revisions

The compare_assemblies.py script can help you to manually inspect polishing results.

It takes two assemblies (e.g. pre-polishing and post-polishing) as input, aligns them and then produces a human-readable output showing regions of the alignment where there are differences. It can compare any two assemblies (i.e. they don't have to be pre/post Polypolish) as long as the two assemblies have the same contigs in the same order.

Requirements, installation instructions and usage are available here:
github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki/Comparing-assemblies

Sample output

before_polishing 1193-1222: GGTTTGTAGCAAAAA-CTAAGCCCACCAAGA
 after_polishing 1193-1223: GGTTTGTAGCAAAAAACTAAGCCCACCAAGA
                                           *               

before_polishing 1247-1276: TTTTTTATTCAAAAA-GAAAGCCCTCTTCAA
 after_polishing 1248-1278: TTTTTTATTCAAAAAAGAAAGCCCTCTTCAA
                                           *               

before_polishing 1650-1679: AATAAAGTCTTTTTT-GTTCTCTCTATTAAA
 after_polishing 1652-1682: AATAAAGTCTTTTTTTGTTCTCTCTATTAAA
                                           *               

before_polishing 1733-1979: AAAGTACGAAGGATTTTATTCTGCATAAGATCATGATTGACCATGTTTAGGATGGAAGATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTGGAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTAATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCATTTCGTTGGT
 after_polishing 1736-1828: AAAGTACGAAGGATT--ATT--GC-T----T--T-A---A---TG-----------------CAG-G--A-A-GTAA------------TC---------GC---GA-T-----A--G-A-T---A---T-------T-------T-CG----------GA-T-A--TCATT-CTT----G-A--A-G----C----TA-------C--A----------G----T---G--CATCATTTCGTTGGT
                                           **   **  * **** ** * *** ***  *****************   * ** * *    ************  *********  ***  * ***** ** * * *** *** ******* ******* *  **********  * * **     *   **** * ** * **** ****  ******* ** ********** **** *** **               

before_polishing 3787-3816: TGCGCTGTAGAGGGG-ATGTCGCTTTATTTA
 after_polishing 3635-3665: TGCGCTGTAGAGGGGGATGTCGCTTTATTTA
                                           *                   

before_polishing 6434-6463: AGAGGAGGAACGGGG-AGCTTGGCAGCCGCT
 after_polishing 6284-6314: AGAGGAGGAACGGGGGAGCTTGGCAGCCGCT
                                           *               

As you can see, most of the regions of difference in this example are single changes in homopolymer length – exactly the sort of change one would expect to see after polishing a long-read assembly with short reads.

However, one region (position 1733-1979 in the before-polishing sequence) contains far more differences. This could indicate that the before-polishing sequence was poorly assembled in that region, or maybe that something has gone wrong with the polishing. E.g. maybe the long reads used to generate the assembly and the short reads used to polish disagree at that locus. Either way, it's a region of interest and the sort of thing this human-readable file can help you to identify.