Tiny Earth antiSMASH Tutorial

Materials herein were prepared for the 2020 Tiny Earth Symposium by Marc Chevrette (University of Wisconsin-Madison, Wisconsin Institute for Discovery, Tiny Earth Chemistry Hub) and Kristin Labby (Beloit College, Tiny Earth Partner Instructor).

Watch the Tiny Earth Symposium tutorial and workshop here!

Attribution

If you find antiSMASH useful in your research, please cite the appropriate version in any work or publications. This tutorial currently applies to antiSMASH v5.

Introduction

From the antiSMASH 5.0 User Manual (accessed 2020-01-08):

Many microbial genomes contain several (up to 30-40) gene clusters encoding the biosynthesis of secondary metabolites. Subsequently mining genetic data has become a very important method in modern screening approaches for bioactive compounds like antibiotics. The antibiotics and secondary metabolites analysis shell antiSMASH is a comprehensive pipeline for the automated mining of finished or draft genome data for the presence of secondary metabolite biosynthetic gene clusters. antiSMASH is an Open Source software written in Python.

What is a genome, anyway?

The word 'genome' gets thrown around a lot nowadays with ever-decreasing costs of DNA sequencing and ever-increasing piles of sequencing data. Here, we’ll use the word genome to refer to a DNA sequencing assembly.

A brief overview of DNA sequencing

To generate an assembly of a bacteria’s genome, researches (including those at the Tiny Earth Chemistry Hub) have to lyse open the cells and physically or chemically isolate the DNA. This DNA is then prepped for sequencing (different technologies have different preparation protocols) before being read through a sequencer, one molecule at a time. It’s important to remember that each of these molecules is not the entire genome, but a small piece of it. These pieces are called reads. In Illumina (Solexa) sequencing, reads are between 150 and a few hundred base pairs long. In “long read” technologies like PacBio or Oxford Nanopore, the reads can be ~100,000 base pairs long. Regardless of the read length, currently we don’t have the technology to sequence an entire genome from beginning to end, so we end up with these reads/pieces that we need to put back together. Since in a given sequencing run will sequence millions of these reads randomly broken at various loci in the genome, a lot of reads will overlap and assembly algorithms can use this information to put your genome back together into contigs, or large contiguous stretches of DNA sequence. Keep in mind that this is not a perfect process, so sometimes there are more than one contig after assembly. This could be due to biology (e.g. plasmids will be assembled to their own contigs) or because of technical reasons (e.g. poor quality DNA prep, insufficient sequencing depth, etc.).

Fasta format

Many different formatting conventions exist for files containing sequencing data. For this tutorial, we will focus on one: fasta format. It is quite simple and is the standard for storing assemblies. Nucleotide assemblies are typically stored as fasta files with the file extension *.fasta or *.fna. They have two components:

The header
- A fasta header is what you (or NCBI or the assembly program) name your sequence. These lines start with a > and will immediately be followed by the sequence name (no spaces). Optionally, after the name descriptions are added after a space. These descriptions are just for humans and are typically ignored by analysis software.
The sequence itself
- The sequence itself begins on the line just after a header. This can be multiple lines, all on one line, whatever you like. Anything after a given header will be treated as one contiguous sequence, until the next header is seen.

>example_sequence_name Example description. This is my favorite sequence
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT
>another_sequence_name This is not my favorite sequence
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT
GCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATATGCGCATAT

In the example above, the T at the end of line 2 is directly connected to the first A of line 3. The newlines (enters) are ignored. Notice that where genes begin and end are not captured here. No annotations, just a long string of sequence. This is where antiSMASH comes in...it will predict genes from your fasta, annotate their proposed functions, and identify which genes are arranged in biosynthetic gene clusters.

Using antiSMASH to identify biosynthetic gene clusters within a genomic DNA sequence

Directions for general use of antiSMASH (Tiny Earth specific instructions to follow)

Prepare, locate, identify your genomic DNA sequence

antiSMASH works with genomic DNA sequences in plain FASTA (nucleotide) format, EMBL, or Genbank. Two options exist for using the antiSMASH web interface.
1. Use genomic nucleotide accession number from NCBI database. (GenBank accession number works better than RefSeq; Usually starts with “NZ_” or “NC_”; note that some bacterial genomes may be more than one chromosome). This can be confusing for new users, so when in doubt you will want to use the nucleotide accession number, and not other accession numbers (e.g. assembly, protein, etc.).
2. Upload your own genomic sequence as a FASTA file. If obtaining from NCBI genome page, download “Genbank (full)” or “FASTA (text)”. Note that "Genbank (full)" is not the same as "Genbank".

Job submission

In a web-browser open the the antiSMASH homepage (bacterial version).
Enter your email address (you will get an email when results have been processed, ~20-30 mins for typical bacterial genomes).
Load your sequence in one of two ways:
1. Enter NCBI's nucleotide accession number to directly load sequence from NCBI
2. Upload your sequence by using the “upload file” button.
The default settings are recommended in most cases (more on this later).
- “Detection strictness: relaxed” is fine; Leave the following “Extra features”, KnownClusterBlast, ActiveSiteFinder, and SubClusterBlast, ON.
- See section 4.1.2 of User Guide for more info on these extended parameters

Detection stringency

antiSMASH5+ can detect BGCs at three stringency levels: strict, relaxed (default), and loose. For a full description of the rules, see the antiSMASH github repository.

Strict
- This will follow strict rules for finding a BGC. For most biosynthetic classes, this will be the same as relaxed, but for some (mostly multimodular biosynthesis) will have difference between strict and relaxed. For example, KS, AT, and ACP domains must all be present for a T1PKS to be identified. Similarly, C, A, and PCP domains must all be present for NRPSs.
Relaxed
- Same as strict, but not all hallmark domains are necessary to call a cluster. If only one or a few NRPS domains are found and some are not seen, these will be designated “NRPS-like” clusters
Loose
- Same as relaxed, but new classes are added. These are left out of the default as they are often primary metabolism instead of secondary metabolism (thus “false-positives”). These classes include saccharides, fatty-acids, and halogen-containing regions.

For almost all use cases, the default of relaxed is recommended.

Other parameters

From the antiSMASH 5.0 User Manual (accessed 2020-01-08):

KnownClusterBlast

The identified clusters are searched against the MIBiG repository. MIBiG is a hand curated data collection of biosynthetic gene clusters, which have been experimentally characterized.
ClusterBlast

The identified clusters are searched against a comprehensive gene cluster database and similar clusters are identified. The algorithm used here is inspired by MultiGeneBlast. It runs BlastP using each amino acid sequence from a detected gene cluster as a query on a large database of predicted protein sequences from secondary metabolite biosynthetic gene clusters, and pools the results to identify the gene clusters that are most homologous to the gene cluster that was detected in your query nucleotide sequence. Please note that selecting this option increases the runtime significantly.
SubClusterBlast

The identified clusters are searched against a database containing operons involved in the biosynthesis of common secondary metabolite building blocks (e.g. the biosynthesis of non-proteinogenic amino acids).
ActiveSiteFinder

Active sites of several highly conserved biosynthetic enzymes are detected and variations of the active sites are reported.
Cluster Pfam analysis

Each gene product encoded in the detected BGCs is analyzed against the PFAM database. Hits are annotated in the final Genbank/EMBL files that can be downloaded after the analysis is finished. Please note that these results are not displayed on the antiSMASH HTML results page but they are present in the results genbank file that can be downloaded. Also, selecting this option normally increases the runtime.
Pfam-based GO term annotation

This is annotating the Cluster Pfam analysis described above with GO term annotations. Please note that these results are not displayed on the antiSMASH HTML results page but they are present in the results genbank file that can be downloaded (see "Understanding the output" section of the documentation for instructions on how to download the results). Also, selecting this option normally increases the runtime

For Tiny Earth and related exercises, KnownClusterBlast, ClusterBlast, SubClusterBlast, and ActiveSiteFinder are recommended to be selected, while Cluster Pfam analysis and Pfam-based GO term annotation are recommended to be left unselected. Depending on the antiSMASH queue, this will (usually) allow for the analysis to finish within 20-30m and could fit within a single teaching session.

Interpreting results

Results will be emailed to you if an email address was provided. They will also appear in your browser if you leave the page open while job is running. See below for an annotated example overview antiSMASH results page:

Using antiSMASH in a Tiny Earth course

Procedure

Use antiSMASH to identify biosynthetic gene clusters within genomic DNA sequences. You should perform at least two antiSMASH analyses of genomes from:

One of our TE isolates.
1. If available, use the complete genomic nucleotide sequence of one of our isolates, or use a genomic TE sequence file shared from the ChemHub.
2. If you had successful PCR and sequencing of the 16S rRNA gene of one of your isolates, you can use the genomic, nucleotide sequence of the top BLAST match.
An interesting bacterial secondary metabolite. Use the list of natural products below to use antiSMASH to identify the biosynthetic gene cluster responsible for making that bacterial secondary metabolite.
1. Table 1, below, is a list of interesting natural products. Your job will be to do some research to find what bacterial organism(s) make that natural product. Then search NCBI genome database to find the genomic, nucleotide accession number for that species. (You may also add your own bacterial natural product of interest to Table 1!)
2. Use this genome to run antiSMASH. Please add the species name and accession number to Table 1.

Bacterial secondary metabolite	Student name	Natural product class	Organism	NCBI accession
erythromycin
zwittermicin A
oxytetracycline
actinorhodin
violacein
salinosporamide
cyphomycin
keyicin
...etc
Students can add their own

Table 1: List of interesting bacterial secondary metabolites for antiSMASH analysis.

Bacterial secondary metabolite	Natural product class	Biosynthetic gene cluster type	Organism	NCBI accession
erythromycin	macrolide (aka cyclic polyketide)	type 1 polyketide	Saccharopolyspora erythraea NRRL 2338	NC_009142.1
zwittermicin A	aminopolyol	hybrid type 1 polyketide, trans-AT polyketide, nonribosomal peptide	Bacillus cereus UW85	NZ_LYVD00000000
oxytetracycline	polyketide (tetracycline type)	type 1 polyketide	Streptomyces rimosus WT5260	NZ_CP025551.1
actinorhodin	“benzoisochromanequinone dimer” polyketide	type 2 polyketide	Streptomyces coelicolor A3(2)	AL645882.2
violacein	alkaloid, “bis-indole pigment”	alkaloid	Chromobacterium violaceum ATCC 12472	NC_005085.1
salinosporamide		hybrid NRPS-PKS	Salinispora tropica CNB-440	NC_009380.1
cyphomycin	polyketide (polyene type)	type 1 polyketide	Streptomyces sp. ISID311	NZ_VOQD00000000.1
keyicin	anthracycline	type 2 polyketide	Micromonospora sp. WMMB235	NZ_MDRX01000002.1

Table 2: Example instructor key. Not sure which clusters/metabolites/organisms to add? Check out MIBiG.

Note: In tables 1 and 2, cyphomycin is an example of a producing organism's genome that is a little harder to find on NCBI. Hint: you can find a link to the genome on MIBiG's cyphomycin entry.

Another note: The keyicin example will not show a match to keyicin, because keyicin is not currently in MIBiG (v2.0). This is to highlight that even known examples from the literature my not end up in databases. For those wondering, the keyicin gene cluster is the type 2 polyketide region with many glycosyltransferases.

Results and Discussion

Example classroom activity

Analyze antiSMASH results of one our TE isolates.
1. Analyze your results on the overview page and click on each region to find more information. How many total regions were identified within your genomic sequence? Are there any regions that have identifiable biosynthetic gene clusters? (“most similar known cluster”) Be sure to pay attention to similarity – what does this mean?
2. Share your results: Add this information to our class spreadsheet of TE isolates. To the column titled “possible secondary metabolites produced”, list secondary metabolite clusters with 70% or higher similarity.
3. Share your results: Make a slide(s) in our Google Slides for this project describing the isolate, BLAST results, and antiSMASH results. Include a screen capture of your antiSMASH results. Add structures of the possible secondary metabolites identified. Do a little bit of literature research – how much have these metabolites been studied? Are they reported to have antibacterial activity? Are there any procedures for extraction and isolation?
Analyze antiSMASH results of a known bacterial secondary metabolite.
1. Did antiSMASH identify a region corresponding to the biosynthetic gene cluster that particular secondary metabolite? What is the percent similarity for this region, and what does that indicate? How many total regions were identified within your genomic sequence? Are there any other regions that have identifiable biosynthetic gene clusters? (“most similar known cluster”) What molecules do these biosynthetic gene clusters produce?
2. Share your results: Make a slide(s) in our Google Slides for this project describing your antiSMASH results. Include a screen capture of your antiSMASH results, structure of the bacterial secondary metabolite and answers to the questions above. Do some literature research. Has this biosynthetic gene cluster been characterized? If so, include a figure and reference.

Example topic for classroom discussion

Did you notice any secondary metabolites that appear in several bacterial species? What is the function of these molecules? (geosmin, hopene, etc.)?

Resources and links

antiSMASH: antibiotics and Secondary Metabolite Analysis Shell, a bacterial biosynthetic gene cluster prediction and characterization tool
antiSMASH documentation: a helpful user guide to antiSMASH
fungiSMASH: similar to antiSMASH, but specifically built for fungal genomes
plantiSMASH: similar to antiSMASH, but specifically built for plant genomes
MIBiG: Minimum Information about a Biosynthetic Gene cluster repository, a large, curated repository of biosynthetic gene clusters with annotations and links to relevant publications and/or genomic data
ARTS: Antibiotic Resistant Target Seeker, a tool to help identify resistance genes within BGCs.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
example_TE_genomes		example_TE_genomes
example_public_genomes		example_public_genomes
example_results		example_results
images		images
LICENSE		LICENSE
README.md		README.md
TE_symposium_2020_slides.pdf		TE_symposium_2020_slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny Earth antiSMASH Tutorial

Attribution

Introduction

What is a genome, anyway?

A brief overview of DNA sequencing

Fasta format

Using antiSMASH to identify biosynthetic gene clusters within a genomic DNA sequence

Directions for general use of antiSMASH (Tiny Earth specific instructions to follow)

Prepare, locate, identify your genomic DNA sequence

Job submission

Detection stringency

Other parameters

Interpreting results

Using antiSMASH in a Tiny Earth course

Procedure

Results and Discussion

Example classroom activity

Example topic for classroom discussion

Resources and links

About

Releases

Packages

License

chevrm/antiSMASH_tutorial

Folders and files

Latest commit

History

Repository files navigation

Tiny Earth antiSMASH Tutorial

Attribution

Introduction

What is a genome, anyway?

A brief overview of DNA sequencing

Fasta format

Using antiSMASH to identify biosynthetic gene clusters within a genomic DNA sequence

Directions for general use of antiSMASH (Tiny Earth specific instructions to follow)

Prepare, locate, identify your genomic DNA sequence

Job submission

Detection stringency

Other parameters

Interpreting results

Using antiSMASH in a Tiny Earth course

Procedure

Results and Discussion

Example classroom activity

Example topic for classroom discussion

Resources and links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages