-
Notifications
You must be signed in to change notification settings - Fork 6
Incorporating taxonomic information into sylph with sylph‐tax
Note
This manual uses sylph-tax, which replaces the old sylph-utils program for taxonomy integration. The old manual for sylph-utils
is available here.
Sylph's TSV outputs contain no taxonomic information. However, the sylph-tax program can convert sylph's output into a taxonomic profile (with taxonomic annotations).
See the sylph-tax repository for more information. For a quick start:
conda install -c bioconda sylph-tax
# download taxonomies
sylph-tax download --download-to /any/location
# profiling with GTDB-r220
sylph profile gtdb-r220-c200-dbv1.syldb ... -o sylph_results/out.tsv
# incorporate GTDB-r220 taxonomy into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220
ls *.sylphmpa
*.sylphmpa
files look like this:
#SampleID /home/jshaw/projects/temp/amr/short_reads/SRR14739086_1.fastq.gz Taxonomies_used:['GTDB_r220']
clade_name relative_abundance sequence_abundance ANI (if strain-level) Coverage (if strain-level)
d__Bacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales 35.6384 36.0603 NA NA
....
Tip
This is a valid TSV file, but rows prefixed with #
are comments.
You can read .sylphmpa
files with pandas in python like pd.read_csv('output.sylphmpa',sep='\t', comment='#')
.
There are five important columns:
-
clade_name
: A string liked__Bacteria|p__Actinomycetota|c__Acidimicrobiia|o__Acidimicrobiales|f__Ilumatobacteraceae
that describes the clade.t__STRAIN
represents the exact genome identifier. -
relative_abundance
: the taxonomic relative abundance of the clade -
sequence_abundance
: the sequence abundance of the clade, i.e. the % of reads assigned -
ANI
: this isNA
except for at the strain level (t__strain
). Otherwise it is sylph's ANI estimate. -
Coverage
: This is theEff_cov
orTrue_cov
column of sylph's output.
Tip
Viral-host information is available for IMG/VR 4.1. The -a
option adds a new column in the .sylphmpa
files associating viral genomes to their hosts. For example:
r__Duplodnaviria|k__Heunggongvirae|p__Uroviricota|c__Caudoviricetes|||||t__IMGVR_UViG_2503982007_000001 ... d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis
where IMGVR_UVIG_2503982007's host is Staphylococcus epidermidis.
If you're working with custom sylph databases, you can easily create your own taxonomy metadata file. You can look at our pre-built taxonomy files (https://zenodo.org/records/14320496) for examples.
A taxonomic metadata file is simply a two-column TSV file:
- Column 1: the name of your genome's FASTA file:
my_mag.fa
- Column 2: a semicolon-delimited taxonomy string.
d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus_A;s__Thermococcus_A alcaliphilus
Note: do not add the t__STRAIN
line.
You obtained two new MAGs: genome1.fa
and genome2.fa
and you ran GTDB-tk to get their taxonomic annotation. You want to to profile against the new MAGs and the GTDB database.
- Create a file called
taxonomy.tsv
as follows:
genome1.fa d__Archaea;(...);s__My new species name`
genome2.fa d__Bacteria;(...);g__My genus name;s__My species name2`
- Use
taxonomy.tsv
as an argument tosylph-tax taxprof
.
## profile against gtdb_r220 and your new MAGs
sylph profile gtdb_r220.syldb my_custom_mags.syldb ... -o gtdb+mags_output.tsv
## use your new taxonomy.tsv file and GTDB_r220
sylph-tax taxprof gtdb+mags_output.tsv -t GTDB_r220 taxonomy.tsv
Warning
For Genbank/RefSeq genomes, filenames have to be dealt with carefully.
If _genomic
or _ASM
is in your genome file name, use the part before _genomic
or _ASM
.
So for GCF_002863645.1_ASM286364v1_genomic.fna.gz
, use GCF_002863645.1
in column 1.