Skip to content

Tips for finishing genomes

Ryan Wick edited this page Sep 1, 2017 · 5 revisions

Ideally, a Unicycler hybrid assembly will result in a completed bacterial genome all by itself. But if it doesn't, then the genome might need 'manual completion', which can involve all sorts of different bioinformatics detective work. This page contains some tips and tricks to help you along.

Requirements:

  • Bandage to view/edit assembly graphs
  • minimap2 to quickly align long reads

Check for completion

How do you tell if the assembly is complete? The Unicycler output/log might help. In the 'Bridged assembly graph' section towards the end of Unicycler's pipeline, it will summarise the graph components:

Component   Segments   Links   Length      N50         Longest segment   Status  
    total          7       7   5,676,472   5,583,468         5,583,468           
        1          1       1   5,583,468   5,583,468         5,583,468   complete
        2          1       1      71,104      71,104            71,104   complete
        3          1       1       6,657       6,657             6,657   complete
        4          1       1       5,783       5,783             5,783   complete
        5          1       1       3,514       3,514             3,514   complete
        6          1       1       3,223       3,223             3,223   complete
        7          1       1       2,723       2,723             2,723   complete

Unicycler considers a component complete if it is circular: one segment and one link. This obviously doesn't quite apply if your bacterial genome has linear chromosomes/plasmids, in which case a complete component would have no links.

You could also view the assembly graph (assembly.gfa) in Bandage and check that each contig is circular:

But what if it's not complete? The Unicycler log might have something like this:

Component   Segments   Links   Length      N50         Longest segment   Status    
    total         23      29   5,819,363   5,242,094         5,242,094             
        1          1       1   5,242,094   5,242,094         5,242,094     complete
        2          1       1     252,269     252,269           252,269     complete
        3          1       1     130,933     130,933           130,933     complete
        4          1       1     110,494     110,494           110,494     complete
        5          1       1      69,826      69,826            69,826     complete
        6          1       1       5,783       5,783             5,783     complete
        7         17      23       7,964       1,023             3,382   incomplete

And the Bandage graph might look like this:

Manual completion

There are many reasons why Unicycler might fail to complete a hybrid assembly, and so there is no single easy method for manual completion. You'll need to rely on detective work and bioinformatics-know-how. Some general methods which may help are:

  • Using Bandage to visualise the assembly graphs from various stages of the Unicycler pipeline.
  • Gathering long reads for incomplete regions of the assembly and BLASTing them to the graphs.
  • Aligning short and/or long reads to the assembly and examining the alignments in IGV or Artemis.
  • Using other assemblers (e.g. Canu) on the reads and comparing the results to Unicycler's assembly.

To get you going, here are some real-world examples of assemblies which failed to complete and how I tried to fix them up: