-
Notifications
You must be signed in to change notification settings - Fork 131
Tangled small plasmids
If your sample has multiple small plasmids that are all very different from each other, then Unicycler will probably assemble them separately with only the Illumina reads - easy! But if there are multiple small plasmids which share a lot of sequence in common, they may tangle together in the Illumina assembly graph. This can be a problem because small plasmids are sometimes be very underrepresented in long read sequencing, so Unicycler may not successfully separate them.
This example looks at one such case. Here is the Unicycler output in the 'Bridged assembly graph' section towards the end of the pipeline:
Component Segments Links Length N50 Longest segment Status
total 10 12 5,701,892 5,264,363 5,264,363
1 1 1 5,264,363 5,264,363 5,264,363 complete
2 1 1 187,611 187,611 187,611 complete
3 1 1 147,931 147,931 147,931 complete
4 1 1 89,345 89,345 89,345 complete
5 6 8 12,642 4,264 4,414 incomplete
One incomplete component! Here's what the graph looks like in Bandage, and it's obvious which component is incomplete:
It's a bit hard to see what's going on, so in Bandage I set the scope to just the contigs in the incomplete component and increased the 'Node length per Megabase' setting to a very high value (100000) to stretch them out. Redrawing the graph makes it a lot more clear:
The three main contigs have different depths, so it looks to be three small plasmids with a bit of common sequence tangling them together. However, it would be nice if we could find long-read support for this. While the small plasmids may be underrepresented in the long reads, there may still be some informative long reads. I therefore exported the three main contigs in this tangle and looked for long reads which align (see Read extraction). A couple dozen long reads came up, and I did an in-Bandage BLAST search for them.
This read is clearly from one of the plasmids, but it's not very helpful because it doesn't span the repeat. A lot of the alignments may look something like this.
This read is much more informative. The start and end of the alignment are adjacent in the top-right contig and it spans the repeat. This strongly suggests that the top-right contig is indeed a circular sequence and this read spans the entire thing. The other BLAST hit to the bottom contig is nothing to worry about - just some homologous sequence. The gap in the alignment isn't a cause for concern either - long reads sometimes have low quality regions.
This read similarly supports the bottom contig as a single plasmid. By processes of elimination, we can now be pretty sure that each of the three main contigs is a separate plasmid.
This read is interesting. It spans the repeat in a way that supports the top-left contig as a plasmid. But Bandage visualises the read alignment using a rainbow colour, so where is the first half of the read?! Chimeric reads - reads which consist of two or more distinct pieces of DNA - are not uncommon in long read sequencing. This read is probably a chimera, and its first half is probably from somewhere else in the genome, e.g. the chromosome. What's important is that the read spans the repeat, so the missing front half isn't a cause for concern.
Now that we're confident that the three plasmids should indeed be separate, we need to pull them apart, including the repeat sequence in each. For this I use Bandage's graph editing functions. By duplicating repeat contigs, deleting edges and merging contigs, we can produce three circular contigs:
Here's the entire assembly (don't forget to bring the 'Node length per Megabase' setting back to its default): We then save the assembly to gfa/fasta and we're done!