Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing species in inspect file after database build #853

Open
ksavhughes opened this issue Jul 9, 2024 · 1 comment
Open

Missing species in inspect file after database build #853

ksavhughes opened this issue Jul 9, 2024 · 1 comment

Comments

@ksavhughes
Copy link

Hi Kraken2 developers/community,

I recently built a large Kraken2 database with genomes from the NCBI RefSeq database. I added genomes regardless of assembly level and limited it to 1 assembly per species. But after doing some testing, I discovered that there seemed to be some missing species from the database. Can someone tell me why this is?

I wanted to make sure everything was added correctly after the build, so I ran the inspect command and then compared the taxids in the seqid2taxid.map file to the ones in the inspect file. And there are 897 taxids in the seqid2taxid.map file that were not present in the inspect file. I did some digging and it seems like those species were not added because they didn't have unique minimizers. Can anyone confirm this?

This is the number of species missing per NCBI division:
598 Bacteria
37 Invertebrates
16 Phages
206 Plants and Fungi
1 Rodents
36 Vertebrates
3 Viruses

Notes from further investigating particular missing species

  • Invertebrates - Acropora genus
    • 6 missing species in genus (all mitos)
    • 16 species in database (2 wg 14 mito)
  • Rodents - Mus musculus domesticus
    • added to seqid2taxid.map properly
    • missing 1 mito (domesticus)
      • mito in inspect file with least amount of minimizers = 19 minimizers
    • In DB: Mus musculus has wg and 3 mitos from subspecies other than domesticus
  • Vertebrates:
    • Kali and Beta genus
      • kept 1 mitogenome from 1 species in genus (Betta - also 1 wg)
    • Some hybrids removed
    • Vipera berus - mito removed - no other genomes in genus, but other genomes in family
@slw287r
Copy link

slw287r commented Jul 10, 2024

You can rerun kraken2's inspect subcommand with the option --report-zero-counts to output those taxids without unique minimizers along others to the inspect file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants