-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extra rows and taxaIDs? #38
Comments
Hi @susheelbhanu. What are the contents of the I'm not sure what would cause the results to be larger than the input. Which version of TaxonKit and PyTaxonKit do you have installed? |
Hey @standage, Thanks for the quick reply. Here are the versions: >>> pytaxonkit.__taxonkitversion__
'taxonkit v0.17.0'
>>> pytaxonkit.__version__
'0.8' And this is what names contains >>> names[:10]
['KD4-96', 'Candidatus Udaeobacter', 'Bacillales', 'Bacillaceae', 'KD4-96', 'Candidatus Udaeobacter', 'Candidatus Nitrocosmicus', 'Micrococcaceae', 'MB-A2-108', 'Gaiella'] |
What is the length of |
17835 |
So it has 1 less element than |
Sorry typo.. >>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of names: 17836 |
Sorry typo.. >>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of new_names: 17836 |
This is unexpected behavior indeed. I'm not sure there's much more I can do unless you can share the entire contents of |
Thank you, I'm happy to share the file later tonight. And yes, I'm trying to keep the shape so as to merge it later with another file. Appreciate your help with this! |
Here's the file and how I get the import pytaxonkit, os
import pandas as pd
# reading in the 16S taxa
agrii_tax = pd.read_csv("TaxaId16s.csv", header = 0)
# dropping the unnamed column
agrii_tax = agrii_tax.drop(columns=['Unnamed: 0'])
# Rename the 'ASVrank' column to 'ASV'
agrii_tax = agrii_tax.rename(columns={'ASVrank': 'ASV'})
# Move 'ASV' to the first column
cols = ['ASV'] + [col for col in agrii_tax.columns if col != 'ASV']
agrii_tax = agrii_tax[cols]
# Create a new column 'name' by finding the last non-NaN value in each row
agrii_tax['name'] = agrii_tax[['species', 'genus', 'family', 'order', 'class', 'phylum', 'domain']].bfill(axis=1).iloc[:, 0]
# Replace NaN values in the 'name' column with 'unclassified'
agrii_tax['name'].fillna('unclassified', inplace=True)
# Extract the 'name' column from your DataFrame
names = agrii_tax['name'].tolist()
# Run pytaxonkit.name2taxid with the names
taxid_results = pytaxonkit.name2taxid(names)
# To view the results
print(taxid_results) Thank you! |
Ok, I understand the issue a bit better now. It doesn't appear to be an issue with TaxonKit or PyTaxonKit, but an artifact of the NCBI Taxonomy. To investigate, I discarded all of the unclassified values, kept the remaining unique values, and performed the >>> mynames = list(set([n for n in names if n != "unclassified"]))
>>> len(mynames)
827
>>> taxid_results = pytaxonkit.name2taxid(mynames)
>>> taxid_results
Name TaxID Rank
0 Aeromicrobium 2040 genus
1 Pir2 lineage <NA> <NA>
2 Streptosporangium 2000 genus
3 Pedosphaera 1032526 genus
4 Polycyclovorans 1274363 genus
.. ... ... ...
843 Duganella 75654 genus
844 Emticicia 312278 genus
845 Pleurocapsa PCC-7319 <NA> <NA>
846 GWC2-45-44 <NA> <NA>
847 Cyanobacteriales <NA> <NA>
[848 rows x 3 columns] So there must be some duplicated values. I found them with the following code. >>> taxid_results[taxid_results.Name.duplicated(keep=False)].sort_values("Name")
Name TaxID Rank
418 Actinobacteria 201174 phylum
417 Actinobacteria 201174 phylum
762 Archaea 2157 superkingdom
761 Archaea 2157 superkingdom
214 Bacillus 1386 genus
215 Bacillus 55087 genus
830 Bacteria 2 superkingdom
831 Bacteria 2 superkingdom
832 Bacteria 629395 genus
82 Bosea 85413 genus
83 Bosea 169215 genus
768 Chloroflexi 200795 phylum
767 Chloroflexi 32061 class
568 Cyanobacteria 1117 phylum
567 Cyanobacteria 1117 phylum
177 Diplosphaera 381755 genus
178 Diplosphaera 1148783 genus
331 Firmicutes 1239 phylum
332 Firmicutes 1239 phylum
736 Gordonia 79255 genus
735 Gordonia 2053 genus
416 Labrys 2066135 genus
415 Labrys 204476 genus
584 Leptothrix 88 genus
585 Leptothrix 1907117 genus
758 Longispora 203522 genus
759 Longispora 2759766 genus
380 Nitrospira 1234 genus
381 Nitrospira 203693 class
187 Paracoccus 265 genus
188 Paracoccus 249411 genus
802 Planctomycetes 112 order
803 Planctomycetes 203682 phylum
804 Planctomycetes 203683 class
792 Proteobacteria 1224 phylum
793 Proteobacteria 1224 phylum
227 Rhodococcus 1827 genus
228 Rhodococcus 1661425 genus
311 Syntrophus 1671858 genus
310 Syntrophus 43773 genus It turns out that some of these names are associated with multiple entries in the NCBI taxonomy files (names.dmp). Some of these entries are redundant (same name from different sources with the same taxid) while some names actually refer to different taxa. I'm afraid that resolving these nomenclature issues to identify the "correct" taxid for each name is outside the scope of pytaxonkit. |
Thank you so much @standage! I understand that it's outside the scope, but at least it's good to know where the discrepancy is coming from. Appreciate your prompt help with this. Best, |
Hi,
Firstly, thank you for this amazing tool! I have a question regarding possible duplicates when running
name2taxid
on a larger list.My list (column:
name
) below contains 17836 elementsHowever, when i run the
name2taxid
conversion on them I get the following:21969 rows in the results compared to 17835 in the input. Is it possible that some 'names' are getting duplicate taxaIDs?
Thank you for your help with this,
Susheel
The text was updated successfully, but these errors were encountered: