-
Notifications
You must be signed in to change notification settings - Fork 1
metamlst merge
This script allows multiple intermediate files from MetaMLST.py to be cross-compared and merged together. The output is a list
usage: metamlst-merge.py [-h] -d DB_PATH [--filter species1,species2...]
[-z ED] [--meta METADATA_PATH] [--idField IDFIELD]
[--outseqformat {A,A+,B,B+,C,C+}]
[-j subjectID,diet,age...] [--jgroup]
folder
Detects the MLST profiles from a collection of intermediate files from MetaMLST.py
positional arguments:
folder Path to the folder containing .nfo MetaMLST.py files
optional arguments:
-h, --help show this help message and exit
-d DB_PATH MetaMLST SQLite Database File (created with metaMLST-index)
--filter species1,species2...
Filter for specific set of organisms only (METAMLST-KEYs, comma separated. Use metaMLST-index.py --listspecies to get MLST keys)
-z ED Maximum Edit Distance from the closest reference to call a new MLST allele. Default: 5
--meta METADATA_PATH Metadata file (CSV)
--idField IDFIELD Field number pointing to the 'sampleID' value in the metadata file
--outseqformat {A,A+,B,B+,C,C+}
A : Concatenated Fasta (Only Detected STs)
A+ : Concatenated Fasta (All STs)
B : Single loci (Only New Loci)
B+ : Single loci (All loci)
C : CSV STs Table [default]
-j subjectID,diet,age...
Embed a LIST of metadata in the the output sequences (A or A+ outseqformat modes). Requires a comma separated list of field names from the metadata file specified with --meta
--jgroup Group the output sequences (A or A+ outseqformat modes) by ST, rather than by sample. Requires -j
- -d: Specifies the database file (downloaded or created with MetaMLST-index).
- --filter:: Filters for a subset of species, instead of all the species of the provided database. Species-list must be entered as comma-separated list of MLST-keys. To get the full list of species, use the --listkeys option on the database with MetaMLST-index.
- -z: Exclude potentially new alleles if they are farther than Z SNPs away from their closest reference in the database. This helps to lower background noise, as some samples may include organisms similar to other MLST-trackable microbes (e.g. E. coli and Shigella), and "new" alleles may in fact belong to those. The default value is set to 5. Discarded new alleles (and thus, the profiles that contain them), are showed on screen but not recorded).
Metadata handling
-
--meta: With this option you can specify a metadata file: MetaMLST-merge will look this file for the sample-name and append all the resulting information to the report file, together with the typing information. If this option is not provided, the report file will contain only the filename.
Note: MetaMLST-merge looks for the sample name without extension in the first column of the metadata file (i.e. SRS12345.bam should be indexed in the file as "SRS12345"). To specify a different column, use --idField (See Below).
- --idField: Allows to specify a different "sample name" column for the metadata file specified with --meta. Column-count starts from zero.
MLST Sequences output
-
--outseqformat: [A, A+, B, B+, C]
MetaMLST-merge allows to output the sequences of the reconstructed MLST loci in three formats.
- A Outputs a FASTA file containing one sequence per ST. Each sequence is the concatenation of the aligned MLST loci, which can be used to build a Phylogenetic tree.
- A+ As A, but adds ALL the STs available in the database, in addition to the ones detected in your samples
- B Outputs a FASTA file containing the sequences of the new loci (with an ID > 100000)
- B+ As B, but adds ALL the loci for that organism, in addition to the new ones detected in your sample
- C Outputs a CSV file, containing the sequences of the reconstructed loci for each sample, separated by locus (and not merged and aligned as it is for A)
** Sequences Grouping options**
- -j (Works with --outseqformat A and A+): If you specify -j field1, the FASTA entries' IDs will include the field1 associated with each sample. This option allows to specify a list of columns of the metadata file (i.e. the names of the columns) to embed the data associated with each sample in the sequences IDs the FASTA file.
-
-jgroup (Works with --outseqformat A and A+): Allows to group the entries of the FASTA file by Sequence Type, and not by sample. This means that if two samples share the same ST, they will be included only once in the file instead of two times. The metadata information (-j) will be a combination of the two sample's metadata. The correct usage of this command is:
... --meta METADATA_FILE --outseqformat A -j field1,field2 --jgroup
▲ The output of metamlst-merge.py: for each organism the list of detected STs is reported, distinguishing by:
- Known Profiles (*composed only by known alleles and included in the publicly available databases *)
- New Profiles that passed the stringency thresholds
- Made of one (or more) new alleles, either recurrent or novel
- New combinations (i.e. previously unobserved) combinations of known alleles
The last column accounts for the number of samples harboring a specific ST.
For each species analysed, these are the output files. The files are created in INPUT_FOLDER/merged/
File | Type | Description |
---|---|---|
species1_report.txt | MetaMLST Report File | Contains the result of aggregate analysis for all the samples, specifically for species1. This file contains |
species1_ST.txt | MetaMLST ST File | Contains the new species1 ST table after the analys (all the known profiles plus the new profiles detected in the samples). |
species1_sequences.(fasta/csv) | MetaMLST SEQUENCES File | Contains the sequences of the reconstructed loci for species1. Created only if with --outseqformat (see above). |
MetaMLST is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
M. Zolfo, A. Tett, O. Jousson, C. Donati and N. Segata - MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples - Nucleic Acids Research, 2016 DOI: 10.1093/nar/gkw837