Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profile2cami not working as expected #99

Closed
paulzierep opened this issue Jul 2, 2024 · 6 comments
Closed

profile2cami not working as expected #99

paulzierep opened this issue Jul 2, 2024 · 6 comments

Comments

@paulzierep
Copy link

When using a profile with multiple ranks as input, the output gives summed up ranks for the higher ranks, e.g.:

2	1	
1224	0.7	
1236	0.3	

Leads to:

@SampleID:
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 200.000000000000000
1224 phylum 2|1224 Bacteria|Pseudomonadota 100.000000000000000
1236 class 2|1224|1236 Bacteria|Pseudomonadota|Gammaproteobacteria 30.000000000000000

That's not how the Cami Biobox is defined. https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd states The percentages given for all taxa from the same rank should sum up to <= 100%.
So the normalization should happen per rank in my understanding.

@shenwei356
Copy link
Owner

This tool was originally designed to convert profile tables with only leaf taxa. If your example, only 1236 should be given.

Input format: 
  1. The input file should be tab-delimited
  2. At least two columns needed:
     a) TaxId of taxon at species or lower rank.
     b) Abundance (could be percentage, automatically detected or use -p/--percentage)

However, for the results of some classification tools, which do not always assign taxids as low as species rank, this tool can't handle them now.

There's another script that could handle this, but it does not exist now ...
https://raw.githubusercontent.com/hzi-bifo/cami2_pipelines/master/bin/tocami.py

@shenwei356
Copy link
Owner

OK, I'll improve it, tonight or tomorrow.

@paulzierep
Copy link
Author

paulzierep commented Jul 2, 2024

Ideally I could support output from taxpasta, that would allow us to channel all profile outputs trough this tool for usage with OPAL.
Thank u btw. for the quick reply !!

@shenwei356
Copy link
Owner

OK, it's fixed by adding a new flag.

  -S, --no-sum-up             do not sum up abundance from child to parent TaxIds

Please try the binary and tell me if you have any further issues (I tested it, and it's compatible with previous behaviour).

You example:

$ taxonkit profile2cami t.tsv  -S
@SampleID:
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK    TAXPATH TAXPATHSN       PERCENTAGE
2       superkingdom    2       Bacteria        100.000000000000000
1224    phylum  2|1224  Bacteria|Pseudomonadota 70.000000000000000
1236    class   2|1224|1236     Bacteria|Pseudomonadota|Gammaproteobacteria     30.000000000000000

Another one I made:

$ cat example/abundance2.tsv
2       0.99
1224    0.59
1236    0.2
28211   0.4
1239    0.4
91061   0.39
2759    0.01
9606    0.01

$ taxonkit profile2cami example/abundance2.tsv -S
@SampleID:
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK    TAXPATH TAXPATHSN       PERCENTAGE
2       superkingdom    2       Bacteria        99.000000000000000
2759    superkingdom    2759    Eukaryota       1.000000000000000
1224    phylum  2|1224  Bacteria|Pseudomonadota 59.000000000000000
1239    phylum  2|1239  Bacteria|Bacillota      40.000000000000000
7711    phylum  2759|7711       Eukaryota|Chordata      1.000000000000000
28211   class   2|1224|28211    Bacteria|Pseudomonadota|Alphaproteobacteria     40.000000000000000
91061   class   2|1239|91061    Bacteria|Bacillota|Bacilli      39.000000000000000
1236    class   2|1224|1236     Bacteria|Pseudomonadota|Gammaproteobacteria     20.000000000000000
40674   class   2759|7711|40674 Eukaryota|Chordata|Mammalia     1.000000000000000
9443    order   2759|7711|40674|9443    Eukaryota|Chordata|Mammalia|Primates    1.000000000000000
9604    family  2759|7711|40674|9443|9604       Eukaryota|Chordata|Mammalia|Primates|Hominidae  1.000000000000000
9605    genus   2759|7711|40674|9443|9604|9605  Eukaryota|Chordata|Mammalia|Primates|Hominidae|Homo     1.000000000000000
9606    species 2759|7711|40674|9443|9604|9605|9606     Eukaryota|Chordata|Mammalia|Primates|Hominidae|Homo|Homo sapiens        1.000000000000000

@paulzierep
Copy link
Author

Hi @shenwei356, thank you for the very quick fix, this works for our purpose, could you make a new release for this and update the bioconda recipe ? We would like to add the tool to Galaxy: galaxyproject/tools-iuc#6085 and therefore need the conda build. We would highly appreciate it, the addition to Galaxy will increase the visibility and usage of your tool.

@shenwei356
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants