{{Talk about NewAtlantis}}
Mg-Traits
is a command line application programmed in BASH, AWK, and R, dedicated to the computation of
functional traits at the metagenome level (i.e., functional aggregated traits), ranging from GC variance and amino acid composition to functional diversity and average genome size. It takes as an input a preprocessed (unassembled) metagenomic sample and outputs the computed metagenomic traits organized in different tables and grouped in separate folders according to the type of data source. (see Fig. 1).
Mg-Traits
allows the systematic computation of a comprehensive set of metagenomic functional traits, which can be used to generate a functional and taxonomic fingerprint and reveal the predominant life-history strategies and ecological processes in a microbial community. Mg-Traits
contributes to improving the exploitation of metagenomic data and facilitates comparative and quantitative studies. Considering the high genomic plasticity of microorganisms and their capacity to rapidly adapt to changing environmental conditions, Mg-Traits constitutes a valuable tool to monitor environmental systems.
Mg-Traits
is simple to run! You can get started using it in one command for linux. Please note that the first time you run this script it will download a docker image and this may take some time.
Ensure you have Docker runtime active at time of running this command.
wget https://github.com/new-atlantis-labs/Mg-Traits/blob/main/run_mg_traits.sh
chmod +x run_mg_traits.sh
./run_mg_traits.sh . . --help
Congratulations, you can now use Mg-Traits
!
Note: the first time you run this command it will download the docker image, and this can take a few minutes.
Looking to build Mg-Traits
locally? Follow these steps. This route is only recommended for those looking to develop on top of Mg-Traits. The NewAtlantis container registry is recommended for most usecases.
First clone the repository and enter it.
git clone https://github.com/new-atlantis-labs/Mg-Traits.git
cd Mg-Traits
Then navigate into the folder cont_env
and lastly build the dockerfile with.
cd cont_env
docker build mg-traits-local:1.0 .
Usage: ./mg_traits.sh <input file> <output dir> <options>
--help print this help
--caz_subfam_annot t|f annotate CAZyme subfamilies (default f)
--clean t|f remove intermediate files (i.e., *.info, *.ffn, *.faa, *.hout) (default f)
--confidence NUM confidence value to run rdp bayes classifier (from 0 to 100; default 50)
--evalue_acn NUM evalue to filter reads for ACN computation (default 1e-15)
--evalue_div NUM evalue to filter reads for diversity estimation (default 1e-15)
--evalue_res NUM evalue to annotate ResFam with hmmsearch (default 1e-15)
--evalue_caz_fam NUM evalue to annotate CAZyme families with hmmsearch (default 1e-15)
--evalue_caz_subfam NUM evalue to annotate CAZyme subfamilies with hmmsearch (default 1e-15)
--evalue_hyd NUM evalue to annotate Hyd with hmmsearch (default 1e-15)
--evalue_ncy NUM evalue to annotate NCycle with diamond (default 1e-15)
--evalue_pcy NUM evalue to annotate PCycle with diamond (default 1e-15)
--evalue_pls NUM evalue to annotate Plastic DB with diamond (default 1e-15)
--nslots NUM number of threads used (default 12)
--max_length NUM maximum read length used to trim reads (from the 3' end) for AGS computation (default 180)
--min_length NUM minimum read length used to estimate taxonomic diversity (default 100)
--overwrite t|f overwrite previous directory (default f)
--ref_db CHAR reference database to run NBC (default silva_nr99_v138_train_set.fa.gz)
--sample_name CHAR sample name (default metagenomex)
--train_file_name CHAR train file name to run FragGeneScan, see FragGeneScan help for options (default illumina_5)
--verbose t|f reduced verbose (default t)
--verbose_all t|f complete verbose (default f)
<input file>: Fasta file used to compute mg-traits.
<output dir>: Output directory to store all computed ,g-traits.
All files including computed traits have the format (tab separated): <sample name> <trait> <value>
This allows a straightforward concatenation of any specific trait computed in different samples.
The computed traits are organized in 13 different folders, as shown below.
For each folder, we added a comment specifying the traits that are included.
.
βββ acn # Average 16S rRNA gene copy number (ACN)
βΒ Β βββ sample_acn.tsv
βΒ Β βββ sample_smrna.blast
βΒ Β βββ sample_smrna.fa
βΒ Β βββ sample_smrna.log
βββ ags # Average genome size (AGS)
βΒ Β βββ sample_ags.tsv
βΒ Β βββ sample_single_cogs_count.tsv
βΒ Β βββ sample_uout.csv
βββ bgc # Biosynthetic Gene Cluster domains (uproc)
βΒ Β βββ sample_bgc_annot.tsv
βΒ Β βββ sample_bgc_stats.tsv
βΒ Β βββ sample.uout
βββ caz # Carbohydrate active enzymes (CAZymes families and subfamilies) (hmmsearch)
βΒ Β βββ sample_caz_fam_annot.tsv
βΒ Β βββ sample_caz_fam.domtblout
βΒ Β βββ sample_caz_fam.hout
βΒ Β βββ sample_caz_fam_stats.tsv
βΒ Β βββ sample_caz_sub_annot.tsv
βΒ Β βββ sample_caz_sub.domtblout
βΒ Β βββ sample_caz_sub.hout
βΒ Β βββ sample_caz_sub_stats.tsv
βββ fun # Pfam (uproc)
βΒ Β βββ sample_fun_annot.tsv
βΒ Β βββ sample_fun_stats.tsv
βΒ Β βββ sample.uout
βββ hyd # Hydrocarbon degradation enzymes (hmmsearch)
βΒ Β βββ sample.domtblout
βΒ Β βββ sample.hout
βΒ Β βββ sample_hyd_annot.tsv
βΒ Β βββ sample_hyd_stats.tsv
βββ ncy # Nitrogen cycling genes (diamond)
βΒ Β βββ sample.blout
βΒ Β βββ sample_ncy_annot.tsv
βΒ Β βββ sample_ncy_stats.tsv
βββ nuc # Nucleotide composition
βΒ Β βββ sample.compseq
βΒ Β βββ sample_gc_stats.tsv
βΒ Β βββ sample.info.gz
βΒ Β βββ sample_nuc_comp
βββ orf # Open Reading Frames (FragGeneScanRs)
βΒ Β βββ sample_aa_comp.tsv
βΒ Β βββ sample_codon_comp.tsv
βΒ Β βββ sample.cusp
βΒ Β βββ sample.faa.gz
βΒ Β βββ sample.ffn.gz
βΒ Β βββ sample_orf_stats.tsv
βββ pcy # Phosphorus cycling genes (diamond)
βΒ Β βββ sample.blout
βΒ Β βββ sample_pcy_annot.tsv
βΒ Β βββ sample_pcy_stats.tsv
βββ pls # Plastic degradation enzymes (diamond)
βΒ Β βββ sample.blout
βΒ Β βββ sample_pls_annot.tsv
βΒ Β βββ sample_pls_stats.tsv
βββ res # Antibiotic resistance genes (hmmsearch)
βΒ Β βββ sample.domtblout
βΒ Β βββ sample.hout
βΒ Β βββ sample_res_annot.tsv
βΒ Β βββ sample_res_stats.tsv
βββ tax # Taxonomy (naive bayes classifier)
βββ sample_centroids.fasta
βββ sample_div.tsv
βββ sample_sample2otu2abund2taxa.tsv
βββ sample_subseq.fasta
βββ sample.uclust
The acn and ags outputs are explained here.
Functional composition (i.e., bgc, caz, fun, hyd, ncy, pcy, and res):
The *_stats.tsv files have the shannon diversity, richness and percentage of ORFs annotated.
The *_annot.tsv files are the gene counts tables.
The *.domtblout, *.hout and the hmmsearch outputs, and the *.uout and *.blout are uproc and diamond outputs, resp.
Nucleotide composition (i.e., nuc):
*.compseq is the compseq (EMBOSS) output.
*.nuc_comp is the tetranucleotide tab formatted output.
*.info is the infoseq (EMBOSS) output.
*.gc_stats.tsv are the GC mean and variance.
Open Reading Frames annotation (i.e., orf):
*_aa_comp.tsv is the amino acid composition.
*_codon_comp.tsv is the codon composition.
*.cusp is the cusp (EMBOSS) output.
*.faa.gz are the ORF amino acid sequences.
*.ffn.gz are the ORF nucleotide sequences.
Taxonomic annotation (i.e., tax):
*_centroids.fasta are the cluster centroid sequences.
*_div.tsv are diversity and richness estimates.
*_sample2otu2abund2taxa.tsv is the taxonomic annotation output.
*_subseq.fasta 16s rRNA genes extracted from reads.
*.uclust is the uclust output.
Figure 1. Mg-Traits pipeline. The metagenomic traits computed by the Mg-Traits pipeline are divided into four different groups. The first includes the metagenomic traits computed at the nucleotide level: GC content, GC variance, and Tetranucleotide frequency. The second group includes the traits obtained from the open reading frame (ORF) sequence data: ORFs to Base Pairs (BPs) ratio, Codon frequency, Amino acid frequency, and Acidic to basic amino acid ratio. The third group is based on the functional annotation of the ORF amino acid sequences. For this, we use Pfam and another seven specialized databases: Biosynthetic Gene Cluster (BGC) domains, Resfams, CANT-HYD, NCyc, PCyc, PlastidDB, and CAZymes. For each reference database, we compute the Composition, Diversity, Richness, and Percentage of Annotated Genes. Additionally, this group includes the percentage of transcription factors (TFs) and the average genome size AGS. Lastly, in the fourth group are included the taxonomy-related metagenomic traits: average copy number of 16S rRNA genes ACN, Taxonomic Composition, Diversity, and Richness.
βββ LICENSE
βββ README.md <- The top-level README for developers using this project.
βββ cont_env
βΒ Β βββ Dockerfile
βΒ Β βββ resources
βΒ Β βΒ Β βββ Pfam_v28.0_acc.txt
βΒ Β βΒ Β βββ PlasticDB.fasta.gz
βΒ Β βΒ Β βββ TF.txt
βΒ Β βΒ Β βββ all_cog_lengths.tsv
βΒ Β βββ software
βΒ Β βββ mg_traits
βΒ Β βββ conf.sh
βΒ Β βββ funs.sh
βΒ Β βββ mg_traits.sh
βΒ Β βββ modules
βΒ Β βΒ Β βββ module10_pcy_mg_traits.sh
βΒ Β βΒ Β βββ module11_pls_mg_traits.sh
βΒ Β βΒ Β βββ module1_nuc_mg_traits.sh
βΒ Β βΒ Β βββ module2_orf_mg_traits.sh
βΒ Β βΒ Β βββ module3_fun_mg_traits.sh
βΒ Β βΒ Β βββ module4_tax_mg_traits.sh
βΒ Β βΒ Β βββ module5_res_mg_traits.sh
βΒ Β βΒ Β βββ module6_bgc_mg_traits.sh
βΒ Β βΒ Β βββ module7_caz_mg_traits.sh
βΒ Β βΒ Β βββ module8_hyd_mg_traits.sh
βΒ Β βΒ Β βββ module9_ncy_mg_traits.sh
βΒ Β βββ toolbox
βΒ Β βββ acn.sh
βΒ Β βββ ags.sh
βΒ Β βββ taxa_annot_DADA2.R
βΒ Β βββ taxa_annot_rRDP.R
βββ figures
βΒ Β βββ Mg-Traits2.png
βΒ Β βββ Mg_Traits-ENG.png
βββ run_mg_traits.sh
Mg-Traits utilizes the following tools:
AGS and ACN tools
BBTools
DADA2
diamond
EMBOSS
FragGeneScanRs
HMMER
R
seqtk
SortMeRNA
tidyverse
UProC
VSEARCH
and databases:
BGC domains
CANT-HYD
dbCAN and dbCAN-sub
NCYc
PCyc
Pfam (UProC format)
PlasticDB
Resfams
Silva SSU nr99 (DADA2 format)
Pereira-Flores E, Barberan A, GlΓΆckner FO, Fernandez-Guerra A (2021) Mg-Traits pipeline: advancing functional trait-based approaches in metagenomics. ARPHA Conference Abstracts 4: e64908. https://doi.org/10.3897/aca.4.e64908
Please reach out with any comments, concerns, or discussion regarding Mg-Traits
. It is primarly maintained by Emliano Perea for NewAtlantis Labs.