Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canu parameter settings to assemble ONT R10.4.1 with SQK-LSK112 data #2131

Closed
thallinger opened this issue Jun 5, 2022 · 14 comments
Closed

Comments

@thallinger
Copy link

We are assembling ONT sequencing data from a yeast produced on a R10.4.1 flow cell using SQK-LSK112
chemistry, which should yield Q20 reads based on ONT's description.
Base calling was performed with guppy 5.1.13.

The average error rate based on the quality values of the (porechop) trimmed reads is 1.55%.
The error rate based on a mapping of the trimmed reads to a known reference with minimap2 and
samtools stats is 2.24%.

Assembly of the data (~85-fold coverage) with canu 2.2 -nanopore yields a less continuous genome compared to
assemblies of ONT data with a higher error rate. In addition, ~10 alleles of the mitochondrial genome are
created, which are not present in comparable assemblies.

We have consulted #1985 and #1715, but are unsure how to translate our observed error rate to the
appropriate canu parameters.

What assembly settings would you recommend?

@gringer
Copy link

gringer commented Jun 6, 2022

I'm commenting here because one of my reports was mentioned; Sergey will have a better idea of canu's performance and optimal parameters for this situation.

Based on my own experience with plasmid assembly, my recommendation would be to first skim off the most accurate reads (based on mean Q value) down to around 40X coverage, then assemble using canu in pacbio-hifi mode, as indicated in #2121. I have a vague memory that the hifi mode includes homopolymer compression when mapping, which can help reduce mapping errors for nanopore reads. Start with default options first, and add more if there are still issues:

canu -pacbio-hifi ont_filtered.fastq genomeSize=14M

After assembly, consider doing additional correction using medaka.

My own approach for filtering out low quality reads can be found here (but bear in mind that it assumes most of the reads are single-cut full-sequence reads from small plasmids).

@skoren
Copy link
Member

skoren commented Jun 8, 2022

I'm not sure what you mean by: "yields a less continuous genome compared to assemblies of ONT data with a higher error rate." Is this you running Canu with default error rate and an adjusted one? Generally there's not much harm in running a higher error rate. Canu will adjust it down as needed, it just will cost you compute/runtime. Multiple allels of a mitochondria are common, it likely results from variability between mitochondria in the sample. The higher the accuracy the more likely you are to separate those variations (e.g. in hifi assemblies you can have hundreds of mitochondria copies).

I second what @gringer said though. With high accuracy ONT data, I'd use the hifi settings as recommended here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files. I'd start with the errorRate and batOptions listed there.

@thallinger
Copy link
Author

Thanks, @gringer for your answer and the suggestions.
We extracted reads with the highest average quality to yield 40x-coverage; the least accurate read had an error-rate of 1.17%, the average was 0.68%.

Assembling these reads with canu -pacbio-hifi ont_filtered.fastq genomeSize=xxM resulted in contigs covering 2.8% of the genome and left 98% of the reads unassembled.
Adding trimming with canu -untrimmed -pacbio-hifi ont_filtered.fastq genomeSize=xxM did improve the assembly marginally only: contigs cover 3.9% of the genome and 97% of the reads are unassembled.

An assembly with canu -nanopore ont_filtered.fastq genomeSize=xxM resulted in contigs covering 100% of the genome and left 0.2% of the reads unassembled. However, it is even less contiguous than the one with the unfiltered data. The multiple alleles of the mitochondria were reduced to 2, though.

It seems that the starting point for the assembly of the plasmid reads is quite different to the one for the yeast read assembly.

@thallinger
Copy link
Author

Thanks, @skoren for your answer.

I'm not sure what you mean by: "yields a less continuous genome compared to assemblies of ONT data with a higher error rate."

Sorry, I meant "... less contiguous ...".

Is this you running Canu with default error rate and an adjusted one?

This is an assembly with the default error rate, as we are not sure how to adapt the parameters to our raw read error rate of 2.24%.

Generally there's not much harm in running a higher error rate. Canu will adjust it down as needed, it just will cost you compute/runtime.

Is there a way to see to what error rate Canu adjusted it to?

Multiple allels of a mitochondria are common, it likely results from variability between mitochondria in the sample. The higher the accuracy the more likely you are to separate those variations (e.g. in hifi assemblies you can have hundreds of mitochondria copies).

We can rule out variability between mitochondria in the sample, as we mapped Illumina data obtained from the same DNA extraction to the known reference. There, no SNPs, InDels or SVs are present.

I'd start with the errorRate and batOptions listed there.

We will give it a try.

@skoren
Copy link
Member

skoren commented Jun 9, 2022

The default is less contiguous than what though? What two assemblies were you comparing? The report will tell you the selected error rate in the unitigging step.

@zacksaud
Copy link

@skoren quick question, with the modified errorRate and batOptions for HQ Nanopore data listed here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files.

Does it effect the Correction and Trimming steps in anyways, or does it just effect the end Assembler step? IE, are there better ways to configure the Correction and Trim steps for R10.4.1 flow cell + SQK-LSK112 data.

Many thanks in advance

@brianwalenz
Copy link
Member

This command?

canu \
 -p ecoli -d ecoli-oxford-uncorrected \
 genomeSize=4.8m \
 -untrimmed \
 correctedErrorRate=0.12 \
 maxInputCoverage=100 \
 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' \
 -pacbio-hifi ecolk12mg1655_R10_3_guppy_345_HAC.fastq

Most of those options are to undo some of what -pacbio-hifi does:

  • -untrimmed will re-enable trimming. (correction is still disabled)
  • correctedErrorRate will increase the trimming/assembling overlap error rate from HiFi rates of 2.5%/1.0% respectively to 12%. I can't imagine changes to this will improve trimming; decreasing it generally improves runtime, until it's too low, then trimming just fails.
  • maxInputCoverage increases allowed input coverage from the Canu default 40x (HiFi default 50x) to 100x.
  • batOptions affects only 4-unitigger.

I don' think we've tried any more experiments on Q20 ONT data and any further tuning will probably need to be algorithmic.

@gringer
Copy link

gringer commented Jun 14, 2022

I'm curious as to how you extracted reads to 40X coverage. Covering only 4% of the genome suggests that it might only be 40 reads that average about 12kb in length (which is far too few). Have you done a sanity check to make sure that the total length of the input reads is about 40 times the genome length (i.e. around 480 Mb for a 12 Mb genome)?

If the total read length is as expected, I'd then shift to wondering about what precisely it is that you're trying to assemble. Are you sure that there's no bacterial contamination in the sample? You mentioned "mitochondrial genome", so you may need to split your reads into at least mitochondrial and non-mitochondrial sequences; what proportion of reads map to the mitochondrial genome? what about a closely related assembled yeast genome?

@skoren
Copy link
Member

skoren commented Jun 14, 2022

There is no way either canu -pacbio-hifi ont_filtered.fastq genomeSize=xxM or canu -untrimmed -pacbio-hifi ont_filtered.fastq genomeSize=xxM would work on ONT data so it's not surprising you got nothing assembled. That assumes an overlap error rate of 1% which is less than 0.5% per read, clearly way too low given your data. So I'd just ignore any assembly generated like that.

The parameters I suggested adjust this rate up to undo some of the HiFi defaults as @brianwalenz explained. I would also use unfiltered data, let the trimming and overlaps filter bad reads not a pre-processor. We always assemble all data. There is no correction step with these parameters just trimming. It's possible you could adjust the trimming error rate and thresholds to get better assemblies but I wouldn't worry about that for now. I also would not worry about multiple mitochondria being assembled, if it's not true variation it's likely recurrent ONT errors that are well-supported because the mitochondria have much higher coverage than the nuclear genome. You can always filter them after assembly.

Have you actually run through an assembly with the unfiltered data and the parameters I suggested? How does it compare to the assemblies you already had.

@thallinger
Copy link
Author

@gringer, thanks for your further comments,

I'm curious as to how you extracted reads to 40X coverage

I used seqkit seqkit -Q 19.3 which yielded the following 40x dataset

   num_seqs      sum_len  min_len  avg_len  max_len
     46,319  380,712,591      886  8,219.4   53,644

Covering only 4% of the genome suggests that it might only be 40 reads ...

The reads had 40x coverage (see above), the contigs of the assembly covered only 4% of the expected genome length.

Are you sure that there's no bacterial contamination in the sample?

Yes, we are sure. There are no contaminations whatsoever; 99% of the filtered reads map to the assembled genome.

@thallinger
Copy link
Author

The default is less contiguous than what though? What two assemblies were you comparing?

We are comparing the assembly of the ONT-LSK112 data with several other assemblies of the same strain based on LSK109 data. This datasets have a mapping based error rate between 3.7% and 4.2%.

Have you actually run through an assembly with the unfiltered data and the parameters I suggested? How does it compare to the assemblies you already had.

We performed assemblies with the default settings ( genomeSize=xxM -nanopore) and the suggested settings
(genomeSize=xxM -untrimmed correctedErrorRate=0.12 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' -pacbio-hifi)
for both the unfiltered (~85-fold coverage) and the quality filtered (~40-fold coverage) datasets (runtime is based on assemblies on the same dedicated node of a HPC cluster):

 Settings     Coverage  TotalContigs   Chrom   Mito  rDNA       HH:MM:SS   Unitigging error rate
 Default          85x          26         6     14      6        3:47:38        4.3453% 
 Suggested        85x          17        13      2      2       72:29:15        3.5832% 
 Default          40x          12         8      1      3          54:53        1.1152% 
 Suggested        40x          24        21      1      2       12:15:56        1.2754% 

We would expect 4 chromosomal contigs, 1 mitochondrial and 3 rDNA contigs. The contiguity seems to be reduced with the suggested settings, whereas the runtime is considerably increased. The adapted error rate seems quite conservative compared to the measured one of 2.24% for the 85x dataset.

I also would not worry about multiple mitochondria being assembled, if it's not true variation it's likely recurrent ONT errors that are well-supported because the mitochondria have much higher coverage than the nuclear genome.

It would be nice, if we could tweak a parameter to avoid them alltogether; if you look at the read support you see any number between 1 and 701; the correct one has a support of 145 (expected would be ~2300).

@thallinger
Copy link
Author

Thanks, @brianwalenz for your comments.

I don' think we've tried any more experiments on Q20 ONT data and any further tuning will probably need to be algorithmic.

If I understand you correctly, any improvements in assembling Q20 ONT data would have to come from changes to /extensions of the canu code?

@skoren
Copy link
Member

skoren commented Jun 30, 2022

Sorry, I'm not sure what you're asking exactly. The -pacbio-hifi I suggested were because you were not getting a complete genome assembled. I assume since you were able to compare metrics, both the default and the suggested are assembling the complete genome? Thus, the question was addressed.

Given the results, I'd say your data is not as high quality as you think and/or the error falls outside of simple-sequence/homopolymers repeats. The median error rate may be 2.24% but I think there's a long tail of higher error rate sequences that end up breaking up the assembly with the suggested parameters. The error rate reported is the overlap. so it's about double the mapped error rate, it looks consistent with your median mapping estimate. So, I'd stick with either the 40x or 85x default results.

As for new developments, we're not adding features/developing Canu further.

@skoren
Copy link
Member

skoren commented Jul 8, 2022

Idle, assembly with -nanopore and -hifi worked, one was just less continuous.

@skoren skoren closed this as completed Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants