Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing against simulated nanopore data. #23

Open
humbleflowers opened this issue Nov 1, 2024 · 3 comments
Open

Comparing against simulated nanopore data. #23

humbleflowers opened this issue Nov 1, 2024 · 3 comments

Comments

@humbleflowers
Copy link

humbleflowers commented Nov 1, 2024

Hello Team,

I simulated a shotgun metagenomic Nanopore dataset using NanoSim, containing:
65% bacterial sequences from 32 bacterial species
35% human sequences

For profiling, I created a Sylph database with ~5K microbial species from RefSeq (excluding human sequences) and ran Sylph with the -u option. However, the bacterial sequence abundance in the results is reported as 100%, with no unclassified reads.

In contrast, profiling the same dataset with Kraken2 resulted in 85% unclassified reads due to the human sequences.

What configurations or steps should I consider to ensure that Sylph correctly identifies unclassified reads, particularly for the human sequences?

sylph command line used
sylph sketch -r {input.fastq}/{wildcards.sample}.clean.fastq.gz ; sylph query ./sylph-5k-PBFAV-v2.syldb {output.sketch} -t 30 > {wildcards.sample}.ani_queries.tsv ; sylph profile -u ./sylph-5k-PBFAV-v2.syldb {output.sketch} -t 30 -o {wildcards.sample}.profile.tsv;

head of sylph taxprofile output

clade_name      relative_abundance      sequence_abundance      ANI (if strain-level)    Coverage (if strain-level)
d__Bacteria     99.99989999999998       100.00000000000001      NA      NA
d__Bacteria|p__Bacteroidota     59.7845 60.69369999999999       NA      NA
d__Bacteria|p__Campylobacterota 11.6763 8.6889  NA      NA
d__Bacteria|p__Pseudomonadota   11.299099999999997      14.9092 NA      NA
d__Bacteria|p__Bacillota        5.7598  4.2177  NA      NA
d__Bacteria|p__Bacillota_A      5.152   5.1825  NA      NA
d__Bacteria|p__Desulfobacterota 2.6447  3.3607  NA      NA
d__Bacteria|p__Actinomycetota   2.6022  1.9832999999999998      NA      NA
d__Bacteria|p__Bacillota_C      1.0813  0.964   NA      NA
@bluenote-1577
Copy link
Owner

Hi

-u is harder to use with nanopore due to some issues with read accuracy. It will have weird effects especially with low accuracy AND host dna.

Look at the cook book and what it says about tye -I setting for read accuracy. If you know read accuracy, you can manually tune the -I setting.

Let me know,

Jim

@humbleflowers
Copy link
Author

Thank you @bluenote-1577
Are you talking about —read-seq-id parameter in profile command?

@bluenote-1577
Copy link
Owner

Yes exactly. Sylph tries to estimate the read seq id automatically but it may fail in your use case due to high host dna.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants