METABOLIC

METabolic And BiogeOchemistry anaLyses In miCrobes
Current Version: 4.0 Tested on: Linux Ubuntu 16.04.6 LTS (GNU/Linux 4.15.0-101-generic x86_64) (June 2020)

This software enables the prediction of metabolic and biogeochemical functional trait profiles to any given genome datasets. These genome datasets can either be metagenome-assembled genomes (MAGs), single-cell amplified genomes (SAGs) or pure culture sequenced genomes. METABOLIC has two main implementations, which are METABOLIC-G and METABOLIC-C. METABOLIC-G.pl allows for generation of metabolic profiles and biogeochemical cycling diagrams of input genomes and does not require input of sequencing reads. METABOLIC-C.pl generates the same output as METABOLIC-G.pl, but as it allows for the input of metagenomic read data, it will generate information pertaining to community metabolism. It can also calculate the genome coverage. The information is parsed and diagrams for elemental/biogeochemical cycling pathways (currently Nitrogen, Carbon, Sulfur and "other") are produced.

Program Name	Program Description
METABOLIC-G.pl	Allows for classification of the metabolic capabilities of input genomes.
METABOLIC-C.pl	Allows for classification of the metabolic capabilities of input genomes, calculation of genome coverage, creation of biogeochemical cycling diagrams, and visualization of community metabolic interactions and energy flow.

If you are using this program, please consider citing our preprint, available on BioRxiv:

Zhou Z, Tran P, Liu Y, Kieft K, Anantharaman K. "METABOLIC: A scalable high-throughput metabolic and biogeochemical functional trait profiler based on microbial genomes" (2019). BioRxiv doi: https://doi.org/10.1101/761643

Version History:

v4.0 -- Jun 22, 2020 --

METABOLIC now uses an R script to generate METABOLIC_result.xlsx, which fixes issues with the generation of a corrupt METABOLIC_result.xlsx file
Test input data now includes both five nucleotide fasta files and one set of paired sequencing reads, allowing all capabilities of both METABOLIC-G.pl and METABOLIC-C.pl to be tested
The MN-score table has been provided as one of the results by METABOLIC-C
Updated the motif checking step for pmo/amo, dsrE/tusD, dsrH/tusB, and dsrF/tusC
Updated the "reads-type" option allowing the use of metatranscriptomic reads to conduct community analysis

v3.0 -- Feb 18, 2020 --

Provide an option to let the user reduce the size of Kofam Hmm profiles (only use KOs that can be found in Modules) to speed up the calculation
Change HMMER to v3.3 to speed up the calculation

v2.0 -- Nov 5, 2019 --

Add more functions on visualization, add more annotations, make the software faster

v1.3 -- Sep 5, 2019 --

Fix the output folder problem, the perl script could be called in another place instead of the original place

v1.2 -- Sep 5, 2019 --

Fix the prodigal parallel run, change "working-dir" to "METABOLIC-dir"

v1.1 -- Sep 4, 2019 --

Fix the parallel problem, change from hmmscan to hmmsearch, and update the "METABOLIC_template_and_database"

System Requirements:

System Memory Requirements:

Due to requirements of some of this program's dependencies, it is highly recommended that METABOLIC-C is run on a system containing at least 100 Gb of memory.
METABOLIC-G is not as demanding as METABOLIC-C and requires significantly less memory to run.

System Storage Requirements:

If you are planning to use only METABOLIC-G, you don't need to install GTDB-tk.

Necessary Databases	Approximate System Storage Required
METABOLIC program with unzipped files	7.69 Gb
GTDB-Tk Reference Data	28 Gb

Dependencies Overview:

Perl (>= v5.010)
HMMER (>= v3.1b2)
Prodigal (>= v2.6.3)
Sambamba (>= v0.7.0) (only for METABOLIG-C)
BAMtools (>= v2.4.0) (only for METABOLIG-C)
CoverM (only for METABOLIG-C)
R (>= 3.6.0)
Diamond
Samtools (only for METABOLIG-C)
Bowtie2 (only for METABOLIG-C)
Gtdb-Tk (only for METABOLIG-C)

Each of these programs should be in the PATH so that they can be accessed regardless of location.

Perl and R Dependencies Detailed Instructions:

Perl Modules:
To install, use the cpan shell by entering "perl -MCPAN -e shell cpan" and then entering
"install [Module Name]", or install by using "cpan -i [Module Name]", or by entering
"cpanm [Module Name]".

Example 1:
perl -MCPAN -e shell cpan
install Data::Dumper

Example 2:
cpan -i Data::Dumper

Example 3:
cpanm Data::Dumper

1. Data::Dumper
2. POSIX
3. Getopt::Long
4. Statistics::Descriptive
5. Array::Split
6. Bio::SeqIO
7. Bio::Perl
8. Bio::Tools::CodonTable
9. Carp
10. File::Spec
11. File::Basename
12. Parallel::ForkManager

Note for later: The last three lines in "run_to_setup.sh" are used to download "METABOLIC_test_files.tgz" from a Google Drive. It requires gdown. gdown can be simply installed by calling "pip install gdown".

R Packages
To install, open the R command line interface by entering "R" into the command line, and then enter
"install.packages("[Package Name]")".

Example:
R
install.packages("diagram")
q()

1. diagram (v1.6.4)
2. forcats (v0.5.0)
3. digest (v0.6.25)
4. htmltools (v0.4.0)
5. rmarkdown (v2.1)
6. reprex (v0.3.0)
7. tidyverse (v1.3.0)
8. ggthemes (v4.2.0)
9. ggalluvial (v0.11.3)
10. reshape2 (v1.4.3)
11. ggraph (v2.0.2)
12. pdftools (v2.3)
13. igraph (v1.2.5)
14. ggraph (v2.0.2)
15. tidygraph (v1.1.2)
16. stringr (v1.4.0)
17. plyr (v1.8.6)
18. dplyr (v0.8.5)
19. openxlsx (v4.1.4)

To ensure efficient and successful installation of METABOLIC, make sure that all dependencies are properly installed prior to download of the METABOLIC software.

Installation Instructions:

Go to where you want the program to be and clone the github repository by using the following command:

git clone https://github.com/AnantharamanLab/METABOLIC.git

or click the green button "download ZIP" folder at the top of the github and unzip the downloaded file.
The perl and R scripts and dependent databases should be kept in the same directory.

NOTE: Before following the next step, make sure your working directory is the directory that was created by the METABOLIC download, that is, the directory containing the main scripts for METABOLIC (METABOLIC-G.pl, METABOLIC-C.pl, etc.).

Quick Installation:

We provide a "run_to_setup.sh" script along with the data downloaded from the GitHub for easy setup of dependent databases. This can be run by using the following command:

sh run_to_setup.sh

Notice: The last three lines in "run_to_setup.sh" were used to download "METABOLIC_test_files.tgz" from google drive. It requires gdown. gdown can be simply installed by calling "pip install gdown". Please also refer to gdown.

Running METABOLIC:

All Required and Optional Flags:

To view the options that METABOLIC-C.pl and METABOLIC-G.pl have, please type:

perl METABOLIC-G.pl -help
perl METABOLIC-C.pl -help

-in-gn [required if you are starting from nucleotide fasta files] Defines the location of the FOLDER containing the genome nucleotide fasta files ending with ".fasta" to be run by this program
-in [required if you are starting from faa files] Defines the location of the FOLDER containing the genome amino acid files ending with ".faa" to be run by this program
-r [required] Defines the path to a text file containing the location of paried reads
-rt [optional] Defines the option to use "metaG" or "metaT" to indicate whether you use the metagenomic reads or metatranscriptomic reads (default: 'metaG')
-t [optional] Defines the number of threads to run the program with (Default: 20)
-m-cutoff [optional] Defines the fraction of KEGG module steps present to designate a KEGG module as present (Default: 0.75)
-kofam-db [optional] Defines the use of the full ("full") or reduced ("small") KOfam database by the program (Default: 'full')
-p [optional] Defines the prodigal method used to annotate ORFs ("meta" or "single")
-o [optional] Defines the output directory to be created by the program (Default: current directory)

The directory specified by the "-in-gn" flag should contain nucleotide sequences for your genomes with the file extension ".fasta". If you are supplying amino acid fasta files for each genome, these should be contained within a directory and have the file extension ".faa", and you will be using the "-in" option instead. Ensure that the fasta headers of each .fasta or .faa file is unique, and that your file names do not contains spaces. If you want to use METABOLIC-C, only "fasta" files are allowed to perform the good analysis.
The "-r" flag allows input of a text file defining the path of metagenomic reads (if running METABOLIC-C). The metagenomic reads refer to the metagenomic read datasets that you used to generate the MAGs. Try to confirm that you are using unzipped fastq files instead of zipped files before you run METABOLIC-C. Sets of paired reads are entered in one line, separated by a ",". A sample for this text file is as follows:

#Read pairs: 
SRR3577362_sub_1.fastq,SRR3577362_sub_2.fastq
SRR3577362_sub2_1.fastq,SRR3577362_sub2_2.fastq

Note that the two different sets of paired reads are separated by a line return (new line), and please avoid empty lines in this text file otherwise software will take blank read files as inputs.

Running Test Data:

The main METABOLIC directory also contains a set of 5 genomes and one set of paired metagenomic reads, which can be used to test that METABOLIC-G and METABOLIC-C were installed correctly. These genomes and reads can be found within the directory METABOLIC_test_files/, which is contained within the METABOLIC program directory.

METABOLIC-C.pl and METABOLIC-G.pl can be run with the test data by using the -test true function of METABOLIC:

perl METABOLIC-G.pl -test true

perl METABOLIC-C.pl -test true

How To Run METABOLIC:

The main scripts that should be used to run the program are METABOLIC-G.pl or METABOLIC-C.pl.

In order to run METABOLIC-G starting from nucleotide sequences, AT LEAST the following flags should be used for METABOLIC-G:

perl METABOLIC-G.pl -in-gn [path_to_folder_with_genome_files] -o [output_directory_to_be_created]

In order to run METABOLIC-G starting from amino acid sequences, AT LEAST the following flags should be used for METABOLIC-G:

perl METABOLIC-G.pl -in [path_to_folder_with_genome_files] -o [output_directory_to_be_created]

In order to run METABOLIC-C, AT LEAST the following flags should be used for METABOLIC-C:

perl METABOLIC-C.pl -in-gn [path_to_folder_with_genome_files] -r [path_to_list_of_paired_reads] -o [output_directory_to_be_created]

METABOLIC Output Files:

Output Files Overview:

Output File	File Description	Generated by METABOLIC-C	Generated by METABOLIC-G
All_gene_collections_mapped.depth.txt	The gene depth of all input genes	X
Each_HMM_Amino_Acid_Sequence/	The faa collection for each hmm file	X	X
intermediate_files/	The hmmsearch, peptides (MEROPS), and CAZymes (dbCAN2) running intermediate files	X	X
KEGG_identifier_result/	The hit and result of each genome by Kofam database	X	X
METABOLIC_Figures/	All figures output from the running of METABOLIC	X	X
METABOLIC_Figures_Input/	All input files for R-generated diagrams	X	X
METABOLIC_result_each_spreadsheet/	TSV files representing each sheet of the created METABOLIC_result.xlsx file	X	X
MN-score_result/	The resulted table for MN-score	X
METABOLIC_result.xlsx	The resulting excel file of METABOLIC	X	X

Output Files Detailed:

METABOLIC result table (METABOLIC_result.xlsx)

This spreadsheet has 6 sheets:

"HMMHitNum" = Presence or absence of custom HMM profiles within each genome, the number of times the HMM profile was identified within a genome, and the scaffold on which the HMM profile was found. The sheet provides a presence/absence indicator, the number of times a protein was identified for a given genome, and the ORF(s) that represent the identified protein.
"FunctionHit" = Presence or absence of sets of proteins which were identified and displayed as separate proteins in the sheet titled "HMMHitNum". For each genome, the functions are identified as "Present" or "Absence".
"KEGGModuleHit" = Annotation of each genome with modules from the KEGG database organized by metabolic category. For each genome, the functions are identified as "Present" or "Absence".
"KEGGModuleStepHit" = Presence or absence of modules from the KEGG database within each genome separated into the steps that make up the module. For each genome, the functions are identified as "Present" or "Absence".
"dbCAN2Hit" = The dbCAN2 annotation results against all genomes (CAZy numbers and hits). For each genome, there are two distinct columns, which show the number of times a CAZy was identified and what ORF(s) represent the protein.
"MEROPSHit" = The MEROPS peptidase searching result (MEROPS peptidase numbers and hits). For each genome, there are two distinct columns, which show the number of times a peptidase was identified and what ORF(s) represent the protein.

In all cases if you scroll down you will see what "Gn00X" colnames refer to (they are based on your fasta file names for the genomes you gave.

Each HMM Profile Hit Amino Acid Sequence Collection (Each_HMM_Amino_Acid_Sequence/)

A collection of all amino acid sequences extracted from the input genome .faa files that were identified as matches to the custom HMM profiles provided by METABOLIC.

KEGG identifier results (KEGG_identifier_result/)

The KEGG identifier searching result - KEGG identifier numbers and hits of each genome that could be used to visualize the pathways in KEGG Mapper

All figures generated by METABOLIC-G.pl and METABOLIC-C.pl (METABOLIC_Figures/)

Both METABOLIC-G.pl and METABOLIC-C.pl will generate a folder titled Nutrient_Cycling_Diagrams/ within the METABOLIC_Figures/ directory, which will contain figures that represent nutrient cycling pathways for Sulfur, Nitrogen, Carbon, and other select pathways found within each genome. METABOLIC-C.pl also has the ability to generate overall community nutrient cycling pathways.

Although the Nutrient_Cycling_Diagrams/ directory is generated by both METABOLIC-G.pl and METABOLIC-C.pl, the files contained within the directory will be dependent on which script is used.

For both programs, METABOLIC-G.pl and METABOLIC-C.pl, the Nutrient_Cycling_Diagrams/ directory will contain the following files:

  [GenomeName].draw_sulfur_cycle_single.PDF
  [GenomeName].draw_nitrogen_cycle_single.PDF
  [GenomeName].draw_other_cycle_single.PDF
  [GenomeName].draw_carbon_cycle_single.PDF

A red arrow designates presence of a pathway step and a black arrow means absence. Note the the width of the arrows does not have any significance.

If you run METABOLIC-C.pl, the software will also calculate relative gene abundances, which will allow for generation of summary diagrams for pathways at a community scale:

  draw_sulfur_cycle_total.PDF
  draw_other_cycle_total.PDF
  draw_nitrogen_cycle_total.PDF
  draw_carbon_cycle_total.PDF

Note the the width of the arrows does not have any significance.

> Generated only by METABOLIC-C.pl are a set of figures representing metabolic handoffs within the community:

For Sequential transformation diagram, we have summarized and visualized the genome number and genome coverage (relative abundance of microorganism) of the microorganisms that were putatively involved in the sequential transformation of both important inorganic elements and organic compounds.

The resulting files are Sequential_transformation_01.pdf and Sequential_transformation_02.pdf.

> Generated only METABOLIC-C.pl is a figure reprsenting energy flow by the community:

For Metabolic energy flow diagram, a Sankey diagram is generated, representing the function fractions that are contributed by various microbial groups in a given community.

The resulting file is Metabolic_energy_flow.pdf.

> METABOLIC-C.pl generates a figure reprsenting metabolic connections between different reactions that are
found within the community:

For Metabolic network diagrams, diagrams representing metabolic connections of biogeochemical cycling steps at both phylum level and the whole community level will be generated.

The resulted files are placed in the directory Metabolic_network/.

For MN-score result, the table showing the MN-score (Metabolic Networking score) will be generated ("MN-score_result.txt"). The first column indicates the MN-score for each function. The rest part of the table indicates the contribution percentage of each phylum to the corresponding function. An example was given:

The resulted files are placed in the directory MN-score result/.

Notice:

If you use metatranscriptomic reads instead of metagenomic reads in METABOLIC-C, gene coverage result will be replaced by transcript coverage [normalized into Reads Per Kilobase of transcript, per Million mapped reads (RPKM)] and all the community analyses were performed based on the transcript coverage instead. A result file of "All_gene_collections_transcript_coverage.txt" will be generated in the output directory in addition.

Copyright

METABOLIC: METabolic And BiogeOchemistry anaLyses In miCrobes (C) 2019
Zhichao Zhou, zczhou2017@gmail.com
Patricia Tran, ptran5@wisc.edu
Karthik Anantharaman, karthik@bact.wisc.edu
Anantharaman Microbiome Laboratory
Department of Bacteriology, University of Wisconsin, Madison

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Accessory_scripts.tgz		Accessory_scripts.tgz
All_Module_KO_ids.txt		All_Module_KO_ids.txt
METABOLIC-C.jpg		METABOLIC-C.jpg
METABOLIC-C.pl		METABOLIC-C.pl
METABOLIC-G.jpg		METABOLIC-G.jpg
METABOLIC-G.pl		METABOLIC-G.pl
METABOLIC.jpg		METABOLIC.jpg
METABOLICWorkflow.jpg		METABOLICWorkflow.jpg
METABOLICWorkflow.pdf		METABOLICWorkflow.pdf
METABOLIC_hmm_db.tgz		METABOLIC_hmm_db.tgz
METABOLIC_template_and_database.tgz		METABOLIC_template_and_database.tgz
MN-score_table_example.jpg		MN-score_table_example.jpg
Motif.tgz		Motif.tgz
README.md		README.md
create_excel_spreadsheet.R		create_excel_spreadsheet.R
draw_biogeochemical_cycles.R		draw_biogeochemical_cycles.R
draw_metabolic_energy_flow.R		draw_metabolic_energy_flow.R
draw_metabolic_network.R		draw_metabolic_network.R
draw_sequential_reaction.R		draw_sequential_reaction.R
run_to_setup.sh		run_to_setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

METABOLIC

Table of Contents:

Version History:

System Requirements:

Dependencies Overview:

Perl and R Dependencies Detailed Instructions:

Installation Instructions:

Quick Installation:

Running METABOLIC:

All Required and Optional Flags:

Running Test Data:

How To Run METABOLIC:

METABOLIC Output Files:

Output Files Overview:

Output Files Detailed:

Copyright

About

Releases

Packages

Languages

Ale-Rossi/METABOLIC

Folders and files

Latest commit

History

Repository files navigation

METABOLIC

Table of Contents:

Version History:

System Requirements:

Dependencies Overview:

Perl and R Dependencies Detailed Instructions:

Installation Instructions:

Quick Installation:

Running METABOLIC:

All Required and Optional Flags:

Running Test Data:

How To Run METABOLIC:

METABOLIC Output Files:

Output Files Overview:

Output Files Detailed:

Copyright

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages