-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canu parameter settings to assemble ONT R10.4.1 with SQK-LSK112 data #2131
Comments
I'm commenting here because one of my reports was mentioned; Sergey will have a better idea of canu's performance and optimal parameters for this situation. Based on my own experience with plasmid assembly, my recommendation would be to first skim off the most accurate reads (based on mean Q value) down to around 40X coverage, then assemble using canu in
After assembly, consider doing additional correction using medaka. My own approach for filtering out low quality reads can be found here (but bear in mind that it assumes most of the reads are single-cut full-sequence reads from small plasmids). |
I'm not sure what you mean by: "yields a less continuous genome compared to assemblies of ONT data with a higher error rate." Is this you running Canu with default error rate and an adjusted one? Generally there's not much harm in running a higher error rate. Canu will adjust it down as needed, it just will cost you compute/runtime. Multiple allels of a mitochondria are common, it likely results from variability between mitochondria in the sample. The higher the accuracy the more likely you are to separate those variations (e.g. in hifi assemblies you can have hundreds of mitochondria copies). I second what @gringer said though. With high accuracy ONT data, I'd use the hifi settings as recommended here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files. I'd start with the errorRate and batOptions listed there. |
Thanks, @gringer for your answer and the suggestions. Assembling these reads with An assembly with It seems that the starting point for the assembly of the plasmid reads is quite different to the one for the yeast read assembly. |
Thanks, @skoren for your answer.
Sorry, I meant "... less contiguous ...".
This is an assembly with the default error rate, as we are not sure how to adapt the parameters to our raw read error rate of 2.24%.
Is there a way to see to what error rate Canu adjusted it to?
We can rule out variability between mitochondria in the sample, as we mapped Illumina data obtained from the same DNA extraction to the known reference. There, no SNPs, InDels or SVs are present.
We will give it a try. |
The default is less contiguous than what though? What two assemblies were you comparing? The report will tell you the selected error rate in the unitigging step. |
@skoren quick question, with the modified errorRate and batOptions for HQ Nanopore data listed here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files. Does it effect the Correction and Trimming steps in anyways, or does it just effect the end Assembler step? IE, are there better ways to configure the Correction and Trim steps for R10.4.1 flow cell + SQK-LSK112 data. Many thanks in advance |
This command?
Most of those options are to undo some of what
I don' think we've tried any more experiments on Q20 ONT data and any further tuning will probably need to be algorithmic. |
I'm curious as to how you extracted reads to 40X coverage. Covering only 4% of the genome suggests that it might only be 40 reads that average about 12kb in length (which is far too few). Have you done a sanity check to make sure that the total length of the input reads is about 40 times the genome length (i.e. around 480 Mb for a 12 Mb genome)? If the total read length is as expected, I'd then shift to wondering about what precisely it is that you're trying to assemble. Are you sure that there's no bacterial contamination in the sample? You mentioned "mitochondrial genome", so you may need to split your reads into at least mitochondrial and non-mitochondrial sequences; what proportion of reads map to the mitochondrial genome? what about a closely related assembled yeast genome? |
There is no way either The parameters I suggested adjust this rate up to undo some of the HiFi defaults as @brianwalenz explained. I would also use unfiltered data, let the trimming and overlaps filter bad reads not a pre-processor. We always assemble all data. There is no correction step with these parameters just trimming. It's possible you could adjust the trimming error rate and thresholds to get better assemblies but I wouldn't worry about that for now. I also would not worry about multiple mitochondria being assembled, if it's not true variation it's likely recurrent ONT errors that are well-supported because the mitochondria have much higher coverage than the nuclear genome. You can always filter them after assembly. Have you actually run through an assembly with the unfiltered data and the parameters I suggested? How does it compare to the assemblies you already had. |
@gringer, thanks for your further comments,
I used seqkit
The reads had 40x coverage (see above), the contigs of the assembly covered only 4% of the expected genome length.
Yes, we are sure. There are no contaminations whatsoever; 99% of the filtered reads map to the assembled genome. |
We are comparing the assembly of the ONT-LSK112 data with several other assemblies of the same strain based on LSK109 data. This datasets have a mapping based error rate between 3.7% and 4.2%.
We performed assemblies with the default settings (
We would expect 4 chromosomal contigs, 1 mitochondrial and 3 rDNA contigs. The contiguity seems to be reduced with the suggested settings, whereas the runtime is considerably increased. The adapted error rate seems quite conservative compared to the measured one of 2.24% for the 85x dataset.
It would be nice, if we could tweak a parameter to avoid them alltogether; if you look at the read support you see any number between 1 and 701; the correct one has a support of 145 (expected would be ~2300). |
Thanks, @brianwalenz for your comments.
If I understand you correctly, any improvements in assembling Q20 ONT data would have to come from changes to /extensions of the canu code? |
Sorry, I'm not sure what you're asking exactly. The Given the results, I'd say your data is not as high quality as you think and/or the error falls outside of simple-sequence/homopolymers repeats. The median error rate may be 2.24% but I think there's a long tail of higher error rate sequences that end up breaking up the assembly with the suggested parameters. The error rate reported is the overlap. so it's about double the mapped error rate, it looks consistent with your median mapping estimate. So, I'd stick with either the 40x or 85x default results. As for new developments, we're not adding features/developing Canu further. |
Idle, assembly with |
We are assembling ONT sequencing data from a yeast produced on a R10.4.1 flow cell using SQK-LSK112
chemistry, which should yield Q20 reads based on ONT's description.
Base calling was performed with
guppy 5.1.13
.The average error rate based on the quality values of the (
porechop
) trimmed reads is 1.55%.The error rate based on a mapping of the trimmed reads to a known reference with
minimap2
andsamtools stats
is 2.24%.Assembly of the data (~85-fold coverage) with
canu 2.2 -nanopore
yields a less continuous genome compared toassemblies of ONT data with a higher error rate. In addition, ~10 alleles of the mitochondrial genome are
created, which are not present in comparable assemblies.
We have consulted #1985 and #1715, but are unsure how to translate our observed error rate to the
appropriate canu parameters.
What assembly settings would you recommend?
The text was updated successfully, but these errors were encountered: