Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add name2taxid function from taxonkit #6146

Merged
merged 38 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
dda83a8
Add name2taxid function from taxonkit
SantaMcCloud Jul 13, 2024
698d432
rename file
SantaMcCloud Jul 13, 2024
28d1db1
rename file
SantaMcCloud Jul 13, 2024
306860f
rename file and change a param
SantaMcCloud Jul 13, 2024
d3876fb
a little fix
SantaMcCloud Jul 13, 2024
855fae6
change the test such that files will be compared
SantaMcCloud Jul 23, 2024
0af783a
make the name options a bit clear
SantaMcCloud Jul 31, 2024
022abb6
did an example for the show rank option
SantaMcCloud Jul 31, 2024
45b2a87
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Jul 31, 2024
f244cdb
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Jul 31, 2024
7d875e8
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Jul 31, 2024
8ae7dbc
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Jul 31, 2024
ae067f4
test for fix
SantaMcCloud Jul 31, 2024
1be6b8f
test commit
SantaMcCloud Jul 31, 2024
55ef8d0
Merge branch 'main' into taxonkit
SantaMcCloud Jul 31, 2024
ea0ae5e
Merge branch 'taxonkit' of https://github.com/SantaMcCloud/tools-iuc …
SantaMcCloud Jul 31, 2024
fb592ac
fix test
SantaMcCloud Aug 1, 2024
6cfc298
fix test
SantaMcCloud Aug 1, 2024
f50f0b3
commit to fix format problems
SantaMcCloud Aug 1, 2024
380719b
Update taxonkit_name2taxid.xml
SantaMcCloud Aug 1, 2024
65b1996
format fix maybe
SantaMcCloud Aug 1, 2024
b2f8360
delet file for reseting it on github
SantaMcCloud Aug 1, 2024
f93386a
add the deleted file to see if the format is fixes now
SantaMcCloud Aug 1, 2024
03dc04b
Update taxonkit_name2taxid.xml
SantaMcCloud Aug 1, 2024
5e4af94
foramted
SantaMcCloud Aug 1, 2024
12363c9
Update tools/taxonkit/taxonkit_name2taxid.xml
bgruening Aug 8, 2024
2fd395a
change such that the newest version will be dowanloaded and unpacked …
SantaMcCloud Aug 9, 2024
8fe321e
change but fomrat problem
SantaMcCloud Aug 9, 2024
5341887
fix file format
SantaMcCloud Aug 9, 2024
d96c35d
Apply suggestions from code review
bgruening Aug 9, 2024
ccd125e
Delete tools/taxonkit/test-data/test-db/names.dmp
SantaMcCloud Aug 14, 2024
d44f3dd
Delete tools/taxonkit/test-data/test-db/delnodes.dmp
SantaMcCloud Aug 14, 2024
214cc69
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Aug 14, 2024
de7b816
Update tools/taxonkit/taxonkit_name2taxid.xml
SantaMcCloud Aug 14, 2024
7a9c364
revert deleting
SantaMcCloud Aug 14, 2024
49663eb
change value names
SantaMcCloud Aug 14, 2024
3d91034
fix test
SantaMcCloud Aug 14, 2024
e2f9f64
now fixed
SantaMcCloud Aug 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tools/taxonkit/.shed.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ suite:
description: "A suite of tools that brings the TaxonKit project into Galaxy."
long_description: |
TaxonKit is a set of tools for analyzing and manipulating taxonomic data, including converting metagenomic profile tables to CAMI format.

2 changes: 1 addition & 1 deletion tools/taxonkit/macros.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<macros>
<macros>
<xml name="requirements">
<requirements>
<requirement type="package" version="@TOOL_VERSION@">taxonkit</requirement>
Expand Down
134 changes: 134 additions & 0 deletions tools/taxonkit/taxonkit_name2taxid.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
<tool id="name2taxid" name="Name2taxid" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
<description>Convert taxon names to NCBI Taxids</description>
<macros>
<import>macros.xml</import>
</macros>
<expand macro="biotools"/>
<expand macro="requirements"/>
<command detect_errors="exit_code">
<![CDATA[

mkdir -p ../home/.taxonkit &&

#if $data.is_select == 'his':
ln -s '$taxdump' 'taxdump.tar.gz' &&
tar -xf 'taxdump.tar.gz' -C '.' &&
#else:
ln -s '$ncbi.fields.path/names.dmp' 'names.dmp' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need delnodes.dmp and names.dmp in the test-db if we only use the test.tar.gz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, i will remove it,

ln -s '$ncbi.fields.path/merged.dmp' 'merged.dmp' &&
ln -s '$ncbi.fields.path/nodes.dmp' 'nodes.dmp' &&
ln -s '$ncbi.fields.path/delnodes.dmp' 'delnodes.dmp' &&
#end if

taxonkit name2taxid
--data-dir '.'
--name-field '$name_field'
$sci_name
$show_rank
'$input'
> '$output'
]]>
</command>
<inputs>
<param name="input" type="data" format="tabular" label="Input file" help="Input any tsv file where the NCBI names are written. You can also use a .txt but only one name per row!"/>
<param argument="--name-field" type="data_column" data_ref="input" label="Select column with the names" help="Select the column where the name is written"/>
<param argument="--sci-name" type="boolean" falsevalue="" truevalue="--sci-name" checked="false" label="Only searching scientific names" help="With this option a non-scientific name will not yield any taxid since the tool will ignore them in the search. NOTE: The non-scientific names will still be in the output without taxid! "/>
<param argument="--show-rank" type="boolean" falsevalue="" truevalue="--show-rank" checked="false" label="Show rank" help="Use this option to yield the rank of the name in the output. For an example look at the help section!"/>
<conditional name="data">
<param name="is_select" type="select" label="Use either a cached NCBI database or provide a downloaded version.">
<option value="dm">Cached database</option>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is really just nitpicking. But since those params are exposed to users in workflows etc ... can we rename the dm to cached.

and his to history?

<option value="his">History</option>
</param>
<when value="dm">
<param name="ncbi" type="select" label="NCBI database" help="Choose NCBI database version">
<options from_data_table="ncbi_taxonomy">
<validator message="No NCBI database is available" type="no_options"/>
</options>
</param>
</when>
<when value="his">
<param name="taxdump" type="data" format="tgz" label="Input the taxdump.tar.gz file"
help="You can find the taxdum.tar.gz at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"/>
</when>
</conditional>
</inputs>
<outputs>
<data name="output" format="tabular" label="Names2taxID"/>
</outputs>
<tests>
<test>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a test for the rank option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way i can add this since i need special lines from the original database and i dont know how they work together to get the rank option. I did add an example in the help section to show how the output should look it you use a complete database

<param name="input" value="name2taxid_test1.tsv" ftype="tabular"/>
<param name="name_field" value="1"/>
<param name="sci_name" value="True"/>
<conditional name="data">
<param name="is_select" value="dm"/>
<param name="ncbi" value="test-db-tox"/>
</conditional>
<output name="output" file="name2taxid_result1.tsv"/>
</test>
<test>
<param name="input" value="name2taxid_test2.tsv" ftype="tabular"/>
<param name="show_rank" value="True"/>
<conditional name="data">
<param name="is_select" value="dm"/>
<param name="ncbi" value="test-db-tox"/>
</conditional>
<param name="name_field" value="2"/>
<output name="output" file="name2taxid_result2.tsv"/>
</test>
<test>
<param name="input" value="name2taxid_test3.txt" ftype="tabular"/>
<param name="name_field" value="1"/>
<conditional name="data">
<param name="is_select" value="his"/>
<param name="taxdump" ftype="tgz" value="test.tar.gz"/>
</conditional>
<output name="output" file="name2taxid_result3.tsv"/>
</test>
</tests>
<help>
<![CDATA[

This tool can convert a NCBI name to its corresponding taxid. Input a tsv or txt file and state the column where the name are written

.. class:: infomark

Example

::

Homo sapiens
Akkermansia muciniphila ATCC BAA-835
Akkermansia muciniphila
Mouse Intracisternal A-particle

**sci_name option**
.. class:: infomark

For example, the name "Enterococcus coli" is not a scientific name which means with this option you can remove it from the query to find a taxid to it but it will still be in the output. In contrast, for example, Drosophila is a scientific name which means that this will always be searched in the query even if the option is on or off.

**show_rank option**
..class:: infomark

Here is an example of the output if you use the option:

::

Homo sapiens 9606 species
Akkermansia muciniphila ATCC BAA-835 349741 strain
Akkermansia muciniphila 239935 species
Mouse Intracisternal A-particle 11932 species

without this option the output will be:

::

Homo sapiens 9606
Akkermansia muciniphila ATCC BAA-835 349741
Akkermansia muciniphila 239935
Mouse Intracisternal A-particle 11932

]]>
</help>
<expand macro="citations"/>
</tool>
5 changes: 5 additions & 0 deletions tools/taxonkit/test-data/name2taxid_result1.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Homo sapiens 9606
Akkermansia muciniphila ATCC BAA-835 349741
Akkermansia muciniphila 239935
Mouse Intracisternal A-particle 11932
Enterococcus coli
4 changes: 4 additions & 0 deletions tools/taxonkit/test-data/name2taxid_result2.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
test Homo sapiens 9606
test Akkermansia muciniphila ATCC BAA-835 349741
test Akkermansia muciniphila 239935
test Mouse Intracisternal A-particle 11932
4 changes: 4 additions & 0 deletions tools/taxonkit/test-data/name2taxid_result3.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Drosophila 7215
Drosophila 32281
Drosophila 2081351
Enterococcus coli 562
5 changes: 5 additions & 0 deletions tools/taxonkit/test-data/name2taxid_test1.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Homo sapiens
Akkermansia muciniphila ATCC BAA-835
Akkermansia muciniphila
Mouse Intracisternal A-particle
Enterococcus coli
4 changes: 4 additions & 0 deletions tools/taxonkit/test-data/name2taxid_test2.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
test Homo sapiens
test Akkermansia muciniphila ATCC BAA-835
test Akkermansia muciniphila
test Mouse Intracisternal A-particle
2 changes: 2 additions & 0 deletions tools/taxonkit/test-data/name2taxid_test3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Drosophila
Enterococcus coli
1 change: 1 addition & 0 deletions tools/taxonkit/test-data/ncbi_taxonomy.loc.test
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
#value name path
test-db-tox Test Database ${__HERE__}/test-db
1 change: 1 addition & 0 deletions tools/taxonkit/test-data/test-db/delnodes.dmp
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

2923441 |
2923440 |
2923439 |
Expand Down
3 changes: 1 addition & 2 deletions tools/taxonkit/test-data/test-db/names.dmp
Original file line number Diff line number Diff line change
Expand Up @@ -71,5 +71,4 @@
1 | all | | synonym |
1 | root | | scientific name |
131567 | biota | | synonym |
131567 | cellular organisms | | scientific name |

131567 | cellular organisms | | scientific name |
Binary file added tools/taxonkit/test-data/test.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion tools/taxonkit/tool-data/ncbi_taxonomy.loc.sample
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
#value name path
# test-db-tox "Test Database" tool-data/test-db
test-db-tox "Test Database" ${__HERE__}/test-db
2 changes: 1 addition & 1 deletion tools/taxonkit/tool_data_table_conf.xml.sample
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
<!-- Locations of taxonomy data downloaded from NCBI -->
<table name="ncbi_taxonomy" comment_char="#">
<columns>value, name, path</columns>
<file path="tool-data/ncbi_taxonomy.loc" />
<file path="${__HERE__}/test-data/ncbi_taxonomy.loc.test" />
</table>
</tables>
Loading