Comparing against simulated nanopore data. #23

humbleflowers · 2024-11-01T10:14:49Z

Hello Team,

I simulated a shotgun metagenomic Nanopore dataset using NanoSim, containing:
65% bacterial sequences from 32 bacterial species
35% human sequences

For profiling, I created a Sylph database with ~5K microbial species from RefSeq (excluding human sequences) and ran Sylph with the -u option. However, the bacterial sequence abundance in the results is reported as 100%, with no unclassified reads.

In contrast, profiling the same dataset with Kraken2 resulted in 85% unclassified reads due to the human sequences.

What configurations or steps should I consider to ensure that Sylph correctly identifies unclassified reads, particularly for the human sequences?

sylph command line used
sylph sketch -r {input.fastq}/{wildcards.sample}.clean.fastq.gz ; sylph query ./sylph-5k-PBFAV-v2.syldb {output.sketch} -t 30 > {wildcards.sample}.ani_queries.tsv ; sylph profile -u ./sylph-5k-PBFAV-v2.syldb {output.sketch} -t 30 -o {wildcards.sample}.profile.tsv;

head of sylph taxprofile output

clade_name      relative_abundance      sequence_abundance      ANI (if strain-level)    Coverage (if strain-level)
d__Bacteria     99.99989999999998       100.00000000000001      NA      NA
d__Bacteria|p__Bacteroidota     59.7845 60.69369999999999       NA      NA
d__Bacteria|p__Campylobacterota 11.6763 8.6889  NA      NA
d__Bacteria|p__Pseudomonadota   11.299099999999997      14.9092 NA      NA
d__Bacteria|p__Bacillota        5.7598  4.2177  NA      NA
d__Bacteria|p__Bacillota_A      5.152   5.1825  NA      NA
d__Bacteria|p__Desulfobacterota 2.6447  3.3607  NA      NA
d__Bacteria|p__Actinomycetota   2.6022  1.9832999999999998      NA      NA
d__Bacteria|p__Bacillota_C      1.0813  0.964   NA      NA

The text was updated successfully, but these errors were encountered:

bluenote-1577 · 2024-11-01T11:40:40Z

Hi

-u is harder to use with nanopore due to some issues with read accuracy. It will have weird effects especially with low accuracy AND host dna.

Look at the cook book and what it says about tye -I setting for read accuracy. If you know read accuracy, you can manually tune the -I setting.

Let me know,

Jim

humbleflowers · 2024-11-01T12:06:37Z

Thank you @bluenote-1577
Are you talking about —read-seq-id parameter in profile command?

bluenote-1577 · 2024-11-01T15:21:00Z

Yes exactly. Sylph tries to estimate the read seq id automatically but it may fail in your use case due to high host dna.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing against simulated nanopore data. #23

Comparing against simulated nanopore data. #23

humbleflowers commented Nov 1, 2024 •

edited

Loading

bluenote-1577 commented Nov 1, 2024

humbleflowers commented Nov 1, 2024

bluenote-1577 commented Nov 1, 2024

Comparing against simulated nanopore data. #23

Comparing against simulated nanopore data. #23

Comments

humbleflowers commented Nov 1, 2024 • edited Loading

bluenote-1577 commented Nov 1, 2024

humbleflowers commented Nov 1, 2024

bluenote-1577 commented Nov 1, 2024

humbleflowers commented Nov 1, 2024 •

edited

Loading