Skip to content

metamlst merge

Moreno Zolfo edited this page Feb 13, 2020 · 7 revisions

Merging multiple outputs with metamlst-merge.py


This script allows multiple intermediate files from MetaMLST.py to be cross-compared and merged together. The output is a list

▸ Usage

usage: metamlst-merge.py [-h] -d DB_PATH [--filter species1,species2...]
                         [-z ED] [--meta METADATA_PATH] [--idField IDFIELD]
                         [--outseqformat {A,A+,B,B+,C,C+}]
                         [-j subjectID,diet,age...] [--jgroup]
                         folder

Detects the MLST profiles from a collection of intermediate files from MetaMLST.py

positional arguments:
  folder                Path to the folder containing .nfo MetaMLST.py files

optional arguments:
  -h, --help            show this help message and exit
  -d DB_PATH            MetaMLST SQLite Database File (created with metaMLST-index)
  --filter species1,species2...
                        Filter for specific set of organisms only (METAMLST-KEYs, comma separated. Use metaMLST-index.py --listspecies to get MLST keys)
  -z ED                 Maximum Edit Distance from the closest reference to call a new MLST allele. Default: 5
  --meta METADATA_PATH  Metadata file (CSV)
  --idField IDFIELD     Field number pointing to the 'sampleID' value in the metadata file
  --outseqformat {A,A+,B,B+,C,C+}
                        A  : Concatenated Fasta (Only Detected STs)
                        A+ : Concatenated Fasta (All STs)
                        B  : Single loci (Only New Loci)
                        B+ : Single loci (All loci)
                        C  : CSV STs Table [default]
  -j subjectID,diet,age...
                        Embed a LIST of metadata in the the output sequences (A or A+ outseqformat modes). Requires a comma separated list of field names from the metadata file specified with --meta
  --jgroup              Group the output sequences (A or A+ outseqformat modes) by ST, rather than by sample. Requires -j

  • -d: Specifies the database file (downloaded or created with MetaMLST-index).
  • --filter:: Filters for a subset of species, instead of all the species of the provided database. Species-list must be entered as comma-separated list of MLST-keys. To get the full list of species, use the --listkeys option on the database with MetaMLST-index.
  • -z: Exclude potentially new alleles if they are farther than Z SNPs away from their closest reference in the database. This helps to lower background noise, as some samples may include organisms similar to other MLST-trackable microbes (e.g. E. coli and Shigella), and "new" alleles may in fact belong to those. The default value is set to 5. Discarded new alleles (and thus, the profiles that contain them), are showed on screen but not recorded).

Metadata handling

  • --meta: With this option you can specify a metadata file: MetaMLST-merge will look this file for the sample-name and append all the resulting information to the report file, together with the typing information. If this option is not provided, the report file will contain only the filename.

    Note: MetaMLST-merge looks for the sample name without extension in the first column of the metadata file (i.e. SRS12345.bam should be indexed in the file as "SRS12345"). To specify a different column, use --idField (See Below).

  • --idField: Allows to specify a different "sample name" column for the metadata file specified with --meta. Column-count starts from zero.

MLST Sequences output

  • --outseqformat: [A, A+, B, B+, C]

    MetaMLST-merge allows to output the sequences of the reconstructed MLST loci in three formats.

    • A Outputs a FASTA file containing one sequence per ST. Each sequence is the concatenation of the aligned MLST loci, which can be used to build a Phylogenetic tree.
    • A+ As A, but adds ALL the STs available in the database, in addition to the ones detected in your samples
    • B Outputs a FASTA file containing the sequences of the new loci (with an ID > 100000)
    • B+ As B, but adds ALL the loci for that organism, in addition to the new ones detected in your sample
    • C Outputs a CSV file, containing the sequences of the reconstructed loci for each sample, separated by locus (and not merged and aligned as it is for A)

** Sequences Grouping options**

  • -j (Works with --outseqformat A and A+): If you specify -j field1, the FASTA entries' IDs will include the field1 associated with each sample. This option allows to specify a list of columns of the metadata file (i.e. the names of the columns) to embed the data associated with each sample in the sequences IDs the FASTA file.
  • -jgroup (Works with --outseqformat A and A+): Allows to group the entries of the FASTA file by Sequence Type, and not by sample. This means that if two samples share the same ST, they will be included only once in the file instead of two times. The metadata information (-j) will be a combination of the two sample's metadata. The correct usage of this command is: ... --meta METADATA_FILE --outseqformat A -j field1,field2 --jgroup

▸ Interface

MetaMLST-Output

The output of metamlst-merge.py: for each organism the list of detected STs is reported, distinguishing by:

  • Known Profiles (*composed only by known alleles and included in the publicly available databases *)
  • New Profiles that passed the stringency thresholds
    • Made of one (or more) new alleles, either recurrent or novel
    • New combinations (i.e. previously unobserved) combinations of known alleles

The last column accounts for the number of samples harboring a specific ST.

▸ Output Files

For each species analysed, these are the output files. The files are created in INPUT_FOLDER/merged/

File Type Description
species1_report.txt MetaMLST Report File Contains the result of aggregate analysis for all the samples, specifically for species1. This file contains
species1_ST.txt MetaMLST ST File Contains the new species1 ST table after the analys (all the known profiles plus the new profiles detected in the samples).
species1_sequences.(fasta/csv) MetaMLST SEQUENCES File Contains the sequences of the reconstructed loci for species1. Created only if with --outseqformat (see above).
Clone this wiki locally