METabolic And BiogeOchemistry anaLyses In miCrobes
Current Version: 4.0
Tested on: Linux Ubuntu 16.04.6 LTS (GNU/Linux 4.15.0-101-generic x86_64) (June 2020)
This software enables the prediction of metabolic and biogeochemical functional trait profiles to any given genome datasets. These genome datasets can either be metagenome-assembled genomes (MAGs), single-cell amplified genomes (SAGs) or pure culture sequenced genomes. METABOLIC has two main implementations, which are METABOLIC-G and METABOLIC-C. METABOLIC-G.pl allows for generation of metabolic profiles and biogeochemical cycling diagrams of input genomes and does not require input of sequencing reads. METABOLIC-C.pl generates the same output as METABOLIC-G.pl, but as it allows for the input of metagenomic read data, it will generate information pertaining to community metabolism. It can also calculate the genome coverage. The information is parsed and diagrams for elemental/biogeochemical cycling pathways (currently Nitrogen, Carbon, Sulfur and "other") are produced.
Program Name | Program Description |
---|---|
METABOLIC-G.pl | Allows for classification of the metabolic capabilities of input genomes. |
METABOLIC-C.pl | Allows for classification of the metabolic capabilities of input genomes, calculation of genome coverage, creation of biogeochemical cycling diagrams, and visualization of community metabolic interactions and energy flow. |
If you are using this program, please consider citing our preprint, available on BioRxiv:
Zhou Z, Tran P, Liu Y, Kieft K, Anantharaman K. "METABOLIC: A scalable high-throughput metabolic and biogeochemical functional trait profiler based on microbial genomes" (2019). BioRxiv doi: https://doi.org/10.1101/761643
- Version History
- System Requirements
- Dependencies Overview
- Detailed Dependencies
- Installation Instructions
a. Quick Installation - Running METABOLIC
a. Required and Optional Flags
b. How to Run - METABOLIC Output Descriptions
a. Outputs Overview
b. Outputs Detailed - Copyright
v4.0 -- Jun 22, 2020 --
- METABOLIC now uses an R script to generate METABOLIC_result.xlsx, which fixes issues with the generation of a corrupt METABOLIC_result.xlsx file
- Test input data now includes both five nucleotide fasta files and one set of paired sequencing reads, allowing all capabilities of both METABOLIC-G.pl and METABOLIC-C.pl to be tested
- The MN-score table has been provided as one of the results by METABOLIC-C
- Updated the motif checking step for pmo/amo, dsrE/tusD, dsrH/tusB, and dsrF/tusC
- Updated the "reads-type" option allowing the use of metatranscriptomic reads to conduct community analysis
v3.0 -- Feb 18, 2020 --
- Provide an option to let the user reduce the size of Kofam Hmm profiles (only use KOs that can be found in Modules) to speed up the calculation
- Change HMMER to v3.3 to speed up the calculation
v2.0 -- Nov 5, 2019 --
- Add more functions on visualization, add more annotations, make the software faster
v1.3 -- Sep 5, 2019 --
- Fix the output folder problem, the perl script could be called in another place instead of the original place
v1.2 -- Sep 5, 2019 --
- Fix the prodigal parallel run, change "working-dir" to "METABOLIC-dir"
v1.1 -- Sep 4, 2019 --
- Fix the parallel problem, change from hmmscan to hmmsearch, and update the "METABOLIC_template_and_database"
System Memory Requirements:
- Due to requirements of some of this program's dependencies, it is highly recommended that METABOLIC-C is run on a system containing at least 100 Gb of memory.
- METABOLIC-G is not as demanding as METABOLIC-C and requires significantly less memory to run.
System Storage Requirements:
If you are planning to use only METABOLIC-G, you don't need to install GTDB-tk.
Necessary Databases | Approximate System Storage Required |
---|---|
METABOLIC program with unzipped files | 7.69 Gb |
GTDB-Tk Reference Data | 28 Gb |
- Perl (>= v5.010)
- HMMER (>= v3.1b2)
- Prodigal (>= v2.6.3)
- Sambamba (>= v0.7.0) (only for METABOLIG-C)
- BAMtools (>= v2.4.0) (only for METABOLIG-C)
- CoverM (only for METABOLIG-C)
- R (>= 3.6.0)
- Diamond
- Samtools (only for METABOLIG-C)
- Bowtie2 (only for METABOLIG-C)
- Gtdb-Tk (only for METABOLIG-C)
Each of these programs should be in the PATH so that they can be accessed regardless of location.
Perl Modules:
To install, use the cpan shell by entering "perl -MCPAN -e shell cpan" and then entering
"install [Module Name]", or install by using "cpan -i [Module Name]", or by entering
"cpanm [Module Name]".
Example 1:
perl -MCPAN -e shell cpan
install Data::Dumper
Example 2:
cpan -i Data::Dumper
Example 3:
cpanm Data::Dumper
1. Data::Dumper
2. POSIX
3. Getopt::Long
4. Statistics::Descriptive
5. Array::Split
6. Bio::SeqIO
7. Bio::Perl
8. Bio::Tools::CodonTable
9. Carp
10. File::Spec
11. File::Basename
12. Parallel::ForkManager
Note for later: The last three lines in "run_to_setup.sh" are used to download "METABOLIC_test_files.tgz" from a Google Drive. It requires gdown. gdown can be simply installed by calling "pip install gdown".
Note for later: The last three lines in "run_to_setup.sh" are used to download "METABOLIC_test_files.tgz" from a Google Drive. It requires gdown. gdown can be simply installed by calling "pip install gdown".
R Packages
To install, open the R command line interface by entering "R" into the command line, and then enter
"install.packages("[Package Name]")".
Example:
R
install.packages("diagram")
q()
1. diagram (v1.6.4)
2. forcats (v0.5.0)
3. digest (v0.6.25)
4. htmltools (v0.4.0)
5. rmarkdown (v2.1)
6. reprex (v0.3.0)
7. tidyverse (v1.3.0)
8. ggthemes (v4.2.0)
9. ggalluvial (v0.11.3)
10. reshape2 (v1.4.3)
11. ggraph (v2.0.2)
12. pdftools (v2.3)
13. igraph (v1.2.5)
14. ggraph (v2.0.2)
15. tidygraph (v1.1.2)
16. stringr (v1.4.0)
17. plyr (v1.8.6)
18. dplyr (v0.8.5)
19. openxlsx (v4.1.4)
To ensure efficient and successful installation of METABOLIC, make sure that all dependencies are properly installed prior to download of the METABOLIC software.
- Go to where you want the program to be and clone the github repository by using the following command:
git clone https://github.com/AnantharamanLab/METABOLIC.git
or click the green button "download ZIP" folder at the top of the github and unzip the downloaded file.
The perl and R scripts and dependent databases should be kept in the same directory.
NOTE: Before following the next step, make sure your working directory is the directory that was created by the METABOLIC download, that is, the directory containing the main scripts for METABOLIC (METABOLIC-G.pl, METABOLIC-C.pl, etc.).
We provide a "run_to_setup.sh" script along with the data downloaded from the GitHub for easy setup of dependent databases. This can be run by using the following command:
sh run_to_setup.sh
Notice: The last three lines in "run_to_setup.sh" were used to download "METABOLIC_test_files.tgz" from google drive. It requires gdown. gdown can be simply installed by calling "pip install gdown". Please also refer to gdown.
To view the options that METABOLIC-C.pl and METABOLIC-G.pl have, please type:
perl METABOLIC-G.pl -help
perl METABOLIC-C.pl -help
- -in-gn [required if you are starting from nucleotide fasta files] Defines the location of the FOLDER containing the genome nucleotide fasta files ending with ".fasta" to be run by this program
- -in [required if you are starting from faa files] Defines the location of the FOLDER containing the genome amino acid files ending with ".faa" to be run by this program
- -r [required] Defines the path to a text file containing the location of paried reads
- -rt [optional] Defines the option to use "metaG" or "metaT" to indicate whether you use the metagenomic reads or metatranscriptomic reads (default: 'metaG')
- -t [optional] Defines the number of threads to run the program with (Default: 20)
- -m-cutoff [optional] Defines the fraction of KEGG module steps present to designate a KEGG module as present (Default: 0.75)
- -kofam-db [optional] Defines the use of the full ("full") or reduced ("small") KOfam database by the program (Default: 'full')
- -p [optional] Defines the prodigal method used to annotate ORFs ("meta" or "single")
- -o [optional] Defines the output directory to be created by the program (Default: current directory)
-
The directory specified by the "-in-gn" flag should contain nucleotide sequences for your genomes with the file extension ".fasta". If you are supplying amino acid fasta files for each genome, these should be contained within a directory and have the file extension ".faa", and you will be using the "-in" option instead. Ensure that the fasta headers of each .fasta or .faa file is unique, and that your file names do not contains spaces. If you want to use METABOLIC-C, only "fasta" files are allowed to perform the good analysis.
-
The "-r" flag allows input of a text file defining the path of metagenomic reads (if running METABOLIC-C). The metagenomic reads refer to the metagenomic read datasets that you used to generate the MAGs. Try to confirm that you are using unzipped fastq files instead of zipped files before you run METABOLIC-C. Sets of paired reads are entered in one line, separated by a ",". A sample for this text file is as follows:
#Read pairs:
SRR3577362_sub_1.fastq,SRR3577362_sub_2.fastq
SRR3577362_sub2_1.fastq,SRR3577362_sub2_2.fastq
Note that the two different sets of paired reads are separated by a line return (new line), and please avoid empty lines in this text file otherwise software will take blank read files as inputs.
The main METABOLIC directory also contains a set of 5 genomes and one set of paired metagenomic reads, which can be used to test that METABOLIC-G and METABOLIC-C were installed correctly. These genomes and reads can be found within the directory METABOLIC_test_files/
, which is contained within the METABOLIC program directory.
METABOLIC-C.pl and METABOLIC-G.pl can be run with the test data by using the -test true
function of METABOLIC:
perl METABOLIC-G.pl -test true
perl METABOLIC-C.pl -test true
The main scripts that should be used to run the program are METABOLIC-G.pl or METABOLIC-C.pl.
In order to run METABOLIC-G starting from nucleotide sequences, AT LEAST the following flags should be used for METABOLIC-G:
perl METABOLIC-G.pl -in-gn [path_to_folder_with_genome_files] -o [output_directory_to_be_created]
In order to run METABOLIC-G starting from amino acid sequences, AT LEAST the following flags should be used for METABOLIC-G:
perl METABOLIC-G.pl -in [path_to_folder_with_genome_files] -o [output_directory_to_be_created]
In order to run METABOLIC-C, AT LEAST the following flags should be used for METABOLIC-C:
perl METABOLIC-C.pl -in-gn [path_to_folder_with_genome_files] -r [path_to_list_of_paired_reads] -o [output_directory_to_be_created]
Output File | File Description | Generated by METABOLIC-C | Generated by METABOLIC-G |
---|---|---|---|
All_gene_collections_mapped.depth.txt | The gene depth of all input genes | X | |
Each_HMM_Amino_Acid_Sequence/ | The faa collection for each hmm file | X | X |
intermediate_files/ | The hmmsearch, peptides (MEROPS), and CAZymes (dbCAN2) running intermediate files | X | X |
KEGG_identifier_result/ | The hit and result of each genome by Kofam database | X | X |
METABOLIC_Figures/ | All figures output from the running of METABOLIC | X | X |
METABOLIC_Figures_Input/ | All input files for R-generated diagrams | X | X |
METABOLIC_result_each_spreadsheet/ | TSV files representing each sheet of the created METABOLIC_result.xlsx file | X | X |
MN-score_result/ | The resulted table for MN-score | X | |
METABOLIC_result.xlsx | The resulting excel file of METABOLIC | X | X |
This spreadsheet has 6 sheets:
- "HMMHitNum" = Presence or absence of custom HMM profiles within each genome, the number of times the HMM profile was identified within a genome, and the scaffold on which the HMM profile was found. The sheet provides a presence/absence indicator, the number of times a protein was identified for a given genome, and the ORF(s) that represent the identified protein.
- "FunctionHit" = Presence or absence of sets of proteins which were identified and displayed as separate proteins in the sheet titled "HMMHitNum". For each genome, the functions are identified as "Present" or "Absence".
- "KEGGModuleHit" = Annotation of each genome with modules from the KEGG database organized by metabolic category. For each genome, the functions are identified as "Present" or "Absence".
- "KEGGModuleStepHit" = Presence or absence of modules from the KEGG database within each genome separated into the steps that make up the module. For each genome, the functions are identified as "Present" or "Absence".
- "dbCAN2Hit" = The dbCAN2 annotation results against all genomes (CAZy numbers and hits). For each genome, there are two distinct columns, which show the number of times a CAZy was identified and what ORF(s) represent the protein.
- "MEROPSHit" = The MEROPS peptidase searching result (MEROPS peptidase numbers and hits). For each genome, there are two distinct columns, which show the number of times a peptidase was identified and what ORF(s) represent the protein.
In all cases if you scroll down you will see what "Gn00X" colnames refer to (they are based on your fasta file names for the genomes you gave.
A collection of all amino acid sequences extracted from the input genome .faa files that were identified as matches to the custom HMM profiles provided by METABOLIC.
The KEGG identifier searching result - KEGG identifier numbers and hits of each genome that could be used to visualize the pathways in KEGG Mapper
Both METABOLIC-G.pl and METABOLIC-C.pl will generate a folder titled Nutrient_Cycling_Diagrams/
within the METABOLIC_Figures/
directory, which will contain figures that represent nutrient cycling pathways for Sulfur, Nitrogen, Carbon, and other select pathways found within each genome. METABOLIC-C.pl also has the ability to generate overall community nutrient cycling pathways.
Although the Nutrient_Cycling_Diagrams/
directory is generated by both METABOLIC-G.pl and METABOLIC-C.pl, the files contained within the directory will be dependent on which script is used.
For both programs, METABOLIC-G.pl and METABOLIC-C.pl, the Nutrient_Cycling_Diagrams/
directory will contain the following files:
[GenomeName].draw_sulfur_cycle_single.PDF
[GenomeName].draw_nitrogen_cycle_single.PDF
[GenomeName].draw_other_cycle_single.PDF
[GenomeName].draw_carbon_cycle_single.PDF
A red arrow designates presence of a pathway step and a black arrow means absence. Note the the width of the arrows does not have any significance.
If you run METABOLIC-C.pl, the software will also calculate relative gene abundances, which will allow for generation of summary diagrams for pathways at a community scale:
draw_sulfur_cycle_total.PDF
draw_other_cycle_total.PDF
draw_nitrogen_cycle_total.PDF
draw_carbon_cycle_total.PDF
Note the the width of the arrows does not have any significance.
> Generated only by METABOLIC-C.pl are a set of figures representing metabolic handoffs within the community:
For Sequential transformation diagram, we have summarized and visualized the genome number and genome coverage (relative abundance of microorganism) of the microorganisms that were putatively involved in the sequential transformation of both important inorganic elements and organic compounds.
The resulting files are Sequential_transformation_01.pdf
and Sequential_transformation_02.pdf
.
> Generated only METABOLIC-C.pl is a figure reprsenting energy flow by the community:
For Metabolic energy flow diagram, a Sankey diagram is generated, representing the function fractions that are contributed by various microbial groups in a given community.
The resulting file is Metabolic_energy_flow.pdf
.
> METABOLIC-C.pl generates a figure reprsenting metabolic connections between different reactions that are
found within the community:
For Metabolic network diagrams, diagrams representing metabolic connections of biogeochemical cycling steps at both phylum level and the whole community level will be generated.
The resulted files are placed in the directory Metabolic_network/
.
For MN-score result, the table showing the MN-score (Metabolic Networking score) will be generated ("MN-score_result.txt"). The first column indicates the MN-score for each function. The rest part of the table indicates the contribution percentage of each phylum to the corresponding function. An example was given:
The resulted files are placed in the directory MN-score result/
.
Notice:
If you use metatranscriptomic reads instead of metagenomic reads in METABOLIC-C, gene coverage result will be replaced by transcript coverage [normalized into Reads Per Kilobase of transcript, per Million mapped reads (RPKM)] and all the community analyses were performed based on the transcript coverage instead. A result file of "All_gene_collections_transcript_coverage.txt" will be generated in the output directory in addition.
METABOLIC: METabolic And BiogeOchemistry anaLyses In miCrobes (C) 2019
Zhichao Zhou, zczhou2017@gmail.com
Patricia Tran, ptran5@wisc.edu
Karthik Anantharaman, karthik@bact.wisc.edu
Anantharaman Microbiome Laboratory
Department of Bacteriology, University of Wisconsin, Madison
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.