Materials herein were prepared for the 2020 Tiny Earth Symposium by Marc Chevrette (University of Wisconsin-Madison, Wisconsin Institute for Discovery, Tiny Earth Chemistry Hub) and Kristin Labby (Beloit College, Tiny Earth Partner Instructor).
Watch the Tiny Earth Symposium tutorial and workshop here!
If you find antiSMASH useful in your research, please cite the appropriate version in any work or publications. This tutorial currently applies to antiSMASH v5.
From the antiSMASH 5.0 User Manual (accessed 2020-01-08):
Many microbial genomes contain several (up to 30-40) gene clusters encoding the biosynthesis of secondary metabolites. Subsequently mining genetic data has become a very important method in modern screening approaches for bioactive compounds like antibiotics. The antibiotics and secondary metabolites analysis shell antiSMASH is a comprehensive pipeline for the automated mining of finished or draft genome data for the presence of secondary metabolite biosynthetic gene clusters. antiSMASH is an Open Source software written in Python.
The word 'genome' gets thrown around a lot nowadays with ever-decreasing costs of DNA sequencing and ever-increasing piles of sequencing data. Here, we’ll use the word genome to refer to a DNA sequencing assembly.
To generate an assembly of a bacteria’s genome, researches (including those at the Tiny Earth Chemistry Hub) have to lyse open the cells and physically or chemically isolate the DNA. This DNA is then prepped for sequencing (different technologies have different preparation protocols) before being read through a sequencer, one molecule at a time. It’s important to remember that each of these molecules is not the entire genome, but a small piece of it. These pieces are called reads. In Illumina (Solexa) sequencing, reads are between 150 and a few hundred base pairs long. In “long read” technologies like PacBio or Oxford Nanopore, the reads can be ~100,000 base pairs long. Regardless of the read length, currently we don’t have the technology to sequence an entire genome from beginning to end, so we end up with these reads/pieces that we need to put back together. Since in a given sequencing run will sequence millions of these reads randomly broken at various loci in the genome, a lot of reads will overlap and assembly algorithms can use this information to put your genome back together into contigs, or large contiguous stretches of DNA sequence. Keep in mind that this is not a perfect process, so sometimes there are more than one contig after assembly. This could be due to biology (e.g. plasmids will be assembled to their own contigs) or because of technical reasons (e.g. poor quality DNA prep, insufficient sequencing depth, etc.).
Many different formatting conventions exist for files containing sequencing data. For this tutorial, we will focus on one: fasta format. It is quite simple and is the standard for storing assemblies. Nucleotide assemblies are typically stored as fasta files with the file extension *.fasta or *.fna. They have two components:
- The header
- A fasta header is what you (or NCBI or the assembly program) name your sequence. These lines start with a > and will immediately be followed by the sequence name (no spaces). Optionally, after the name descriptions are added after a space. These descriptions are just for humans and are typically ignored by analysis software.
- The sequence itself
- The sequence itself begins on the line just after a header. This can be multiple lines, all on one line, whatever you like. Anything after a given header will be treated as one contiguous sequence, until the next header is seen.
>example_sequence_name Example description. This is my favorite sequence
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
>another_sequence_name This is not my favorite sequence
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT
In the example above, the T at the end of line 2 is directly connected to the first A of line 3. The newlines (enters) are ignored. Notice that where genes begin and end are not captured here. No annotations, just a long string of sequence. This is where antiSMASH comes in...it will predict genes from your fasta, annotate their proposed functions, and identify which genes are arranged in biosynthetic gene clusters.
- antiSMASH works with genomic DNA sequences in plain FASTA (nucleotide) format, EMBL, or Genbank. Two options exist for using the antiSMASH web interface.
- Use genomic nucleotide accession number from NCBI database. (GenBank accession number works better than RefSeq; Usually starts with “NZ_” or “NC_”; note that some bacterial genomes may be more than one chromosome). This can be confusing for new users, so when in doubt you will want to use the nucleotide accession number, and not other accession numbers (e.g. assembly, protein, etc.).
- Upload your own genomic sequence as a FASTA file. If obtaining from NCBI genome page, download “Genbank (full)” or “FASTA (text)”. Note that "Genbank (full)" is not the same as "Genbank".
- In a web-browser open the the antiSMASH homepage (bacterial version).
- Enter your email address (you will get an email when results have been processed, ~20-30 mins for typical bacterial genomes).
- Load your sequence in one of two ways:
- Enter NCBI's nucleotide accession number to directly load sequence from NCBI
- Upload your sequence by using the “upload file” button.
- The default settings are recommended in most cases (more on this later).
- “Detection strictness: relaxed” is fine; Leave the following “Extra features”, KnownClusterBlast, ActiveSiteFinder, and SubClusterBlast, ON.
- See section 4.1.2 of User Guide for more info on these extended parameters
antiSMASH5+ can detect BGCs at three stringency levels: strict, relaxed (default), and loose. For a full description of the rules, see the antiSMASH github repository.
- Strict
- This will follow strict rules for finding a BGC. For most biosynthetic classes, this will be the same as relaxed, but for some (mostly multimodular biosynthesis) will have difference between strict and relaxed. For example, KS, AT, and ACP domains must all be present for a T1PKS to be identified. Similarly, C, A, and PCP domains must all be present for NRPSs.
- Relaxed
- Same as strict, but not all hallmark domains are necessary to call a cluster. If only one or a few NRPS domains are found and some are not seen, these will be designated “NRPS-like” clusters
- Loose
- Same as relaxed, but new classes are added. These are left out of the default as they are often primary metabolism instead of secondary metabolism (thus “false-positives”). These classes include saccharides, fatty-acids, and halogen-containing regions.
For almost all use cases, the default of relaxed is recommended.
From the antiSMASH 5.0 User Manual (accessed 2020-01-08):
-
KnownClusterBlast
The identified clusters are searched against the MIBiG repository. MIBiG is a hand curated data collection of biosynthetic gene clusters, which have been experimentally characterized.
-
ClusterBlast
The identified clusters are searched against a comprehensive gene cluster database and similar clusters are identified. The algorithm used here is inspired by MultiGeneBlast. It runs BlastP using each amino acid sequence from a detected gene cluster as a query on a large database of predicted protein sequences from secondary metabolite biosynthetic gene clusters, and pools the results to identify the gene clusters that are most homologous to the gene cluster that was detected in your query nucleotide sequence. Please note that selecting this option increases the runtime significantly.
-
SubClusterBlast
The identified clusters are searched against a database containing operons involved in the biosynthesis of common secondary metabolite building blocks (e.g. the biosynthesis of non-proteinogenic amino acids).
-
ActiveSiteFinder
Active sites of several highly conserved biosynthetic enzymes are detected and variations of the active sites are reported.
-
Cluster Pfam analysis
Each gene product encoded in the detected BGCs is analyzed against the PFAM database. Hits are annotated in the final Genbank/EMBL files that can be downloaded after the analysis is finished. Please note that these results are not displayed on the antiSMASH HTML results page but they are present in the results genbank file that can be downloaded. Also, selecting this option normally increases the runtime.
-
Pfam-based GO term annotation
This is annotating the Cluster Pfam analysis described above with GO term annotations. Please note that these results are not displayed on the antiSMASH HTML results page but they are present in the results genbank file that can be downloaded (see "Understanding the output" section of the documentation for instructions on how to download the results). Also, selecting this option normally increases the runtime
For Tiny Earth and related exercises, KnownClusterBlast, ClusterBlast, SubClusterBlast, and ActiveSiteFinder are recommended to be selected, while Cluster Pfam analysis and Pfam-based GO term annotation are recommended to be left unselected. Depending on the antiSMASH queue, this will (usually) allow for the analysis to finish within 20-30m and could fit within a single teaching session.
- Results will be emailed to you if an email address was provided. They will also appear in your browser if you leave the page open while job is running. See below for an annotated example overview antiSMASH results page:
Use antiSMASH to identify biosynthetic gene clusters within genomic DNA sequences. You should perform at least two antiSMASH analyses of genomes from:
- One of our TE isolates.
- If available, use the complete genomic nucleotide sequence of one of our isolates, or use a genomic TE sequence file shared from the ChemHub.
- If you had successful PCR and sequencing of the 16S rRNA gene of one of your isolates, you can use the genomic, nucleotide sequence of the top BLAST match.
- An interesting bacterial secondary metabolite.
Use the list of natural products below to use antiSMASH to identify the biosynthetic gene cluster responsible for making that bacterial secondary metabolite.
- Table 1, below, is a list of interesting natural products. Your job will be to do some research to find what bacterial organism(s) make that natural product. Then search NCBI genome database to find the genomic, nucleotide accession number for that species. (You may also add your own bacterial natural product of interest to Table 1!)
- Use this genome to run antiSMASH. Please add the species name and accession number to Table 1.
Bacterial secondary metabolite | Student name | Natural product class | Organism | NCBI accession |
---|---|---|---|---|
erythromycin | ||||
zwittermicin A | ||||
oxytetracycline | ||||
actinorhodin | ||||
violacein | ||||
salinosporamide | ||||
cyphomycin | ||||
keyicin | ||||
...etc | ||||
Students can add their own |
Table 1: List of interesting bacterial secondary metabolites for antiSMASH analysis.
Bacterial secondary metabolite | Natural product class | Biosynthetic gene cluster type | Organism | NCBI accession |
---|---|---|---|---|
erythromycin | macrolide (aka cyclic polyketide) | type 1 polyketide | Saccharopolyspora erythraea NRRL 2338 | NC_009142.1 |
zwittermicin A | aminopolyol | hybrid type 1 polyketide, trans-AT polyketide, nonribosomal peptide | Bacillus cereus UW85 | NZ_LYVD00000000 |
oxytetracycline | polyketide (tetracycline type) | type 1 polyketide | Streptomyces rimosus WT5260 | NZ_CP025551.1 |
actinorhodin | “benzoisochromanequinone dimer” polyketide | type 2 polyketide | Streptomyces coelicolor A3(2) | AL645882.2 |
violacein | alkaloid, “bis-indole pigment” | alkaloid | Chromobacterium violaceum ATCC 12472 | NC_005085.1 |
salinosporamide | hybrid NRPS-PKS | Salinispora tropica CNB-440 | NC_009380.1 | |
cyphomycin | polyketide (polyene type) | type 1 polyketide | Streptomyces sp. ISID311 | NZ_VOQD00000000.1 |
keyicin | anthracycline | type 2 polyketide | Micromonospora sp. WMMB235 | NZ_MDRX01000002.1 |
Table 2: Example instructor key. Not sure which clusters/metabolites/organisms to add? Check out MIBiG.
Note: In tables 1 and 2, cyphomycin is an example of a producing organism's genome that is a little harder to find on NCBI. Hint: you can find a link to the genome on MIBiG's cyphomycin entry.
Another note: The keyicin example will not show a match to keyicin, because keyicin is not currently in MIBiG (v2.0). This is to highlight that even known examples from the literature my not end up in databases. For those wondering, the keyicin gene cluster is the type 2 polyketide region with many glycosyltransferases.
- Analyze antiSMASH results of one our TE isolates.
- Analyze your results on the overview page and click on each region to find more information. How many total regions were identified within your genomic sequence? Are there any regions that have identifiable biosynthetic gene clusters? (“most similar known cluster”) Be sure to pay attention to similarity – what does this mean?
- Share your results: Add this information to our class spreadsheet of TE isolates. To the column titled “possible secondary metabolites produced”, list secondary metabolite clusters with 70% or higher similarity.
- Share your results: Make a slide(s) in our Google Slides for this project describing the isolate, BLAST results, and antiSMASH results. Include a screen capture of your antiSMASH results. Add structures of the possible secondary metabolites identified. Do a little bit of literature research – how much have these metabolites been studied? Are they reported to have antibacterial activity? Are there any procedures for extraction and isolation?
- Analyze antiSMASH results of a known bacterial secondary metabolite.
- Did antiSMASH identify a region corresponding to the biosynthetic gene cluster that particular secondary metabolite? What is the percent similarity for this region, and what does that indicate? How many total regions were identified within your genomic sequence? Are there any other regions that have identifiable biosynthetic gene clusters? (“most similar known cluster”) What molecules do these biosynthetic gene clusters produce?
- Share your results: Make a slide(s) in our Google Slides for this project describing your antiSMASH results. Include a screen capture of your antiSMASH results, structure of the bacterial secondary metabolite and answers to the questions above. Do some literature research. Has this biosynthetic gene cluster been characterized? If so, include a figure and reference.
- Did you notice any secondary metabolites that appear in several bacterial species? What is the function of these molecules? (geosmin, hopene, etc.)?
- antiSMASH: antibiotics and Secondary Metabolite Analysis Shell, a bacterial biosynthetic gene cluster prediction and characterization tool
- antiSMASH documentation: a helpful user guide to antiSMASH
- fungiSMASH: similar to antiSMASH, but specifically built for fungal genomes
- plantiSMASH: similar to antiSMASH, but specifically built for plant genomes
- MIBiG: Minimum Information about a Biosynthetic Gene cluster repository, a large, curated repository of biosynthetic gene clusters with annotations and links to relevant publications and/or genomic data
- ARTS: Antibiotic Resistant Target Seeker, a tool to help identify resistance genes within BGCs.