Incorporating taxonomic information into sylph with sylph‐tax

Note

This manual uses sylph-tax, which replaces the old sylph-utils program for taxonomy integration. The old manual for sylph-utils is available here.

Sylph's TSV outputs contain no taxonomic information. However, the sylph-tax program can convert sylph's output into a taxonomic profile (with taxonomic annotations).

How to generate taxonomic profiles using sylph-tax

See the sylph-tax repository for more information. For a quick start:

conda install -c bioconda sylph-tax

# download taxonomies
sylph-tax download --download-to /any/location

# profiling with GTDB-r220
sylph profile gtdb-r220-c200-dbv1.syldb ... -o sylph_results/out.tsv

# incorporate GTDB-r220 taxonomy into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220 

ls *.sylphmpa

`.sylphmpa` taxonomic profiling output format

*.sylphmpa files look like this:

#SampleID       /home/jshaw/projects/temp/amr/short_reads/SRR14739086_1.fastq.gz        Taxonomies_used:['GTDB_r220']
clade_name      relative_abundance      sequence_abundance      ANI (if strain-level)    Coverage (if strain-level)
d__Bacteria     100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota   100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria    100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales        35.6384 36.0603 NA      NA
....

Tip

This is a valid TSV file, but rows prefixed with # are comments. You can read .sylphmpa files with pandas in python like pd.read_csv('output.sylphmpa',sep='\t', comment='#').

There are five important columns:

clade_name: A string like d__Bacteria|p__Actinomycetota|c__Acidimicrobiia|o__Acidimicrobiales|f__Ilumatobacteraceae that describes the clade. t__STRAIN represents the exact genome identifier.
relative_abundance: the taxonomic relative abundance of the clade
sequence_abundance: the sequence abundance of the clade, i.e. the % of reads assigned
ANI: this is NA except for at the strain level (t__strain). Otherwise it is sylph's ANI estimate.
Coverage: This is the Eff_cov or True_cov column of sylph's output.

Tip

Viral-host information is available for IMG/VR 4.1. The -a option adds a new column in the .sylphmpa files associating viral genomes to their hosts. For example: r__Duplodnaviria|k__Heunggongvirae|p__Uroviricota|c__Caudoviricetes|||||t__IMGVR_UViG_2503982007_000001 ... d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis

where IMGVR_UVIG_2503982007's host is Staphylococcus epidermidis.

Creating custom taxonomies

If you're working with custom sylph databases, you can easily create your own taxonomy metadata file. You can look at our pre-built taxonomy files (https://zenodo.org/records/14320496) for examples.

A taxonomic metadata file is simply a two-column TSV file:

Column 1: the name of your genome's FASTA file:
- my_mag.fa
Column 2: a semicolon-delimited taxonomy string.
- d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus_A;s__Thermococcus_A alcaliphilus

Note: do not add the t__STRAIN line.

Custom taxonomy example usage case

You obtained two new MAGs: genome1.fa and genome2.fa and you ran GTDB-tk to get their taxonomic annotation. You want to to profile against the new MAGs and the GTDB database.

Create a file called taxonomy.tsv as follows:

genome1.fa d__Archaea;(...);s__My new species name`
genome2.fa d__Bacteria;(...);g__My genus name;s__My species name2`

Use taxonomy.tsv as an argument to sylph-tax taxprof.

## profile against gtdb_r220 and your new MAGs
sylph profile gtdb_r220.syldb my_custom_mags.syldb ... -o gtdb+mags_output.tsv

## use your new taxonomy.tsv file and GTDB_r220
sylph-tax taxprof gtdb+mags_output.tsv -t GTDB_r220 taxonomy.tsv

Warning

For Genbank/RefSeq genomes, filenames have to be dealt with carefully.

If _genomic or _ASM is in your genome file name, use the part before _genomic or _ASM.

So for GCF_002863645.1_ASM286364v1_genomic.fna.gz, use GCF_002863645.1 in column 1.

Creating taxonomy metadata from RefSeq?

See this discussion thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporating taxonomic information into sylph with sylph‐tax

How to generate taxonomic profiles using sylph-tax

`.sylphmpa` taxonomic profiling output format

Creating custom taxonomies

Custom taxonomy example usage case

Creating taxonomy metadata from RefSeq?

Clone this wiki locally

Incorporating taxonomic information into sylph with sylph‐tax

How to generate taxonomic profiles using sylph-tax

.sylphmpa taxonomic profiling output format

Creating custom taxonomies

Custom taxonomy example usage case

Creating taxonomy metadata from RefSeq?

Clone this wiki locally

`.sylphmpa` taxonomic profiling output format