Skip to content

metamlst index

Moreno Zolfo edited this page Feb 13, 2020 · 7 revisions

Building a custom DB with metamlst-index.py


MetaMLST-index builds and manages the internal MetaMLST SQLite database. Specifically:

  • creates a new database
  • updates a database with additional MLST-loci sequences and MLST Sequence Types
  • creates Bowtie2 indexes from a database

▸ Usage

usage: metamlst-index.py [-h] [-t TYPINGS] [-s SEQUENCES] [-q DUMP_DB]
                         [-i BUILDINDEX] [-b BUILDBLAST] [-d DB PATH] [--list]
                         [--filter FILTER] [--version]
                         [--bowtie2_threads BOWTIE2_THREADS]
                         [--bowtie2_build BOWTIE2_BUILD]

Builds and manages the MetaMLST SQLite Databases

optional arguments:
  -h, --help            show this help message and exit
  -t TYPINGS, --typings TYPINGS
                        Typings in TAB separated file (Build New Database)
                        (default: None)
  -s SEQUENCES, --sequences SEQUENCES
                        Sequences in FASTA format (comma separated list of
                        files) (default: None)
  -q DUMP_DB, --dump_db DUMP_DB
                        Dump the entire database to file in fasta format)
                        (default: None)
  -i BUILDINDEX, --buildindex BUILDINDEX
                        Build a Bowtie2 Index from the DB (default: None)
  -b BUILDBLAST, --buildblast BUILDBLAST
                        Build a BLAST Index from the DB (default: None)
  -d DB PATH, --database DB PATH
                        MetaMLST Database File (if unset, use the default
                        database. If a file name is given, MetaMLST will
                        create a new DB or update an existing one) (default:
                        [METAMLST_INSTALL_FOLDER]/metamlst_databases/metamlstDB_2018.db)
  --list                Lists all the MLST keys present in the database and
                        exit (default: False)
  --filter FILTER       filters the db for a specific bacterium (default:
                        None)
  --version             Prints version informations (default: False)
  --bowtie2_threads BOWTIE2_THREADS
                        Number of Threads to use with bowtie2-build (default:
                        4)
  --bowtie2_build BOWTIE2_BUILD
                        Full path to the bowtie2-build command to use, deafult
                        assumes that 'bowtie2-build is present in the system
                        path (default: bowtie2-build)

▸ Creating a new database

MetaMLST organizes publicly available MLST data in an internal SQLite database. A premade version of the DB is available with MetaMLST, but you can generate your own starting from your MLST data.

To create a database use:

metamlst-index.py -s MLST_SEQS.fasta -t MLST_TYPES.txt -d NEW_DATABASE.db

NEW_DATABASE.db is the path where the database will be created -s specifies the path to the sequences file (see below) -t specifies the path to the typing file (see below)

Note: You can run metamlst-index.py with both -s and -t (adds the sequences first, then the types), or run the -s and -t phases separately. Please note that in this case you have to perform the -s step before.

▸ Updating an existing database with new sequences and typings

To add types and sequences to an existing database, use:

metamlst-index.py -s MLST_SEQS.fasta -t MLST_TYPES.txt -d MY_DATABASE.db

Please consider that:

  • If you provide a sequence file for a species already in the database, only the new sequences will be added (the others will stay un-updated)
  • If you provide a typing file for a species already in the database, the old typing data will be deleted

▸ Loci Sequences File format

This file contains the MLST sequences in FASTA format. If you have multiple typing files (e.g. one per species) you can either:

  • provide all the sequences in a single file; or
  • provide a comma-separated list of FASTA files; or
  • run metaMLST-index on the same database file subsequently, once for each FASTA file.

The file must be formatted in the following way:

  • Sequence IDs: species_locus_alleleID
  • The character "_" is allowed only to separate species, locus and alleleID
  • All sequences should be identical in length, with no gaps (best practice)

You can find an example file here

▸ Sequence Types (profiles) File format

This file contains the MLST profiles in tab-separated format. If you have multiple typing files (e.g. one per species) you can either:

  • provide a comma-separated list of typing files; or
  • run metaMLST-index on the same database file subsequently, once for each typing file.

Typing files from the publicly available repository (PubMLST) can be used, provided that you add on the first line:

#species|Species Extended Name

where species is the MLST name for the species, the same you use in the sequences FASTA file (see above): species_locus_alleleID

Generally, the typing file must be formatted in the following way:

  • The first line must contain a "#", followed by the MLST-key and the species extended name. (see below)
  • The second line must be a table-header with 'ST' and the names of each MLST locus
  • The following lines contain the profiles numeric IDs
  • columns should not contain any other information (e.g. clonal complexes, ST-complex... etc). The script however ignores the followings: clonal_complex,species,mlst_clade.

You can find an example file here