Skip to content

new-atlantis-labs/Mg-Traits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌊 Mg-Traits : Metagenomic Functional Trait Analysis

{{Talk about NewAtlantis}}

Mg-Traits is a command line application programmed in BASH, AWK, and R, dedicated to the computation of functional traits at the metagenome level (i.e., functional aggregated traits), ranging from GC variance and amino acid composition to functional diversity and average genome size. It takes as an input a preprocessed (unassembled) metagenomic sample and outputs the computed metagenomic traits organized in different tables and grouped in separate folders according to the type of data source. (see Fig. 1).

Mg-Traits allows the systematic computation of a comprehensive set of metagenomic functional traits, which can be used to generate a functional and taxonomic fingerprint and reveal the predominant life-history strategies and ecological processes in a microbial community. Mg-Traits contributes to improving the exploitation of metagenomic data and facilitates comparative and quantitative studies. Considering the high genomic plasticity of microorganisms and their capacity to rapidly adapt to changing environmental conditions, Mg-Traits constitutes a valuable tool to monitor environmental systems.

βš™οΈ Getting Started

Mg-Traits is simple to run! You can get started using it in one command for linux. Please note that the first time you run this script it will download a docker image and this may take some time.

Ensure you have Docker runtime active at time of running this command.

wget https://github.com/new-atlantis-labs/Mg-Traits/blob/main/run_mg_traits.sh

chmod +x run_mg_traits.sh

./run_mg_traits.sh . . --help

Congratulations, you can now use Mg-Traits!
Note: the first time you run this command it will download the docker image, and this can take a few minutes.

πŸ§‘β€πŸ’» Developers

Looking to build Mg-Traits locally? Follow these steps. This route is only recommended for those looking to develop on top of Mg-Traits. The NewAtlantis container registry is recommended for most usecases.

First clone the repository and enter it.

git clone https://github.com/new-atlantis-labs/Mg-Traits.git

cd Mg-Traits

Then navigate into the folder cont_env and lastly build the dockerfile with.

cd cont_env

docker build mg-traits-local:1.0 .

πŸ› οΈ Usage

Usage: ./mg_traits.sh <input file> <output dir> <options>
--help                          print this help
--caz_subfam_annot t|f          annotate CAZyme subfamilies (default f)
--clean t|f                     remove intermediate files (i.e., *.info, *.ffn, *.faa, *.hout) (default f)
--confidence NUM                confidence value to run rdp bayes classifier (from 0 to 100; default 50)
--evalue_acn NUM                evalue to filter reads for ACN computation (default 1e-15)
--evalue_div NUM                evalue to filter reads for diversity estimation (default 1e-15)
--evalue_res NUM                evalue to annotate ResFam with hmmsearch (default 1e-15)
--evalue_caz_fam NUM            evalue to annotate CAZyme families with hmmsearch (default 1e-15)
--evalue_caz_subfam NUM         evalue to annotate CAZyme subfamilies with hmmsearch (default 1e-15)
--evalue_hyd NUM                evalue to annotate Hyd with hmmsearch (default 1e-15)
--evalue_ncy NUM                evalue to annotate NCycle with diamond (default 1e-15)
--evalue_pcy NUM                evalue to annotate PCycle with diamond (default 1e-15)
--evalue_pls NUM                evalue to annotate Plastic DB with diamond (default 1e-15)
--nslots NUM                    number of threads used (default 12)
--max_length NUM                maximum read length used to trim reads (from the 3' end) for AGS computation (default 180)
--min_length NUM                minimum read length used to estimate taxonomic diversity (default 100)
--overwrite t|f                 overwrite previous directory (default f)
--ref_db CHAR                   reference database to run NBC (default silva_nr99_v138_train_set.fa.gz) 
--sample_name CHAR              sample name (default metagenomex)
--train_file_name CHAR          train file name to run FragGeneScan, see FragGeneScan help for options (default illumina_5)
--verbose t|f                   reduced verbose (default t)
--verbose_all t|f               complete verbose (default f)

<input file>: Fasta file used to compute mg-traits.
<output dir>: Output directory to store all computed ,g-traits.

πŸš€ Output

All files including computed traits have the format (tab separated): <sample name> <trait> <value>
This allows a straightforward concatenation of any specific trait computed in different samples.
The computed traits are organized in 13 different folders, as shown below. For each folder, we added a comment specifying the traits that are included.

.
β”œβ”€β”€ acn # Average 16S rRNA gene copy number (ACN)
β”‚Β Β  β”œβ”€β”€ sample_acn.tsv
β”‚Β Β  β”œβ”€β”€ sample_smrna.blast
β”‚Β Β  β”œβ”€β”€ sample_smrna.fa
β”‚Β Β  └── sample_smrna.log
β”œβ”€β”€ ags # Average genome size (AGS)
β”‚Β Β  β”œβ”€β”€ sample_ags.tsv
β”‚Β Β  β”œβ”€β”€ sample_single_cogs_count.tsv
β”‚Β Β  └── sample_uout.csv
β”œβ”€β”€ bgc # Biosynthetic Gene Cluster domains (uproc)
β”‚Β Β  β”œβ”€β”€ sample_bgc_annot.tsv
β”‚Β Β  β”œβ”€β”€ sample_bgc_stats.tsv
β”‚Β Β  └── sample.uout
β”œβ”€β”€ caz # Carbohydrate active enzymes (CAZymes families and subfamilies) (hmmsearch)
β”‚Β Β  β”œβ”€β”€ sample_caz_fam_annot.tsv
β”‚Β Β  β”œβ”€β”€ sample_caz_fam.domtblout
β”‚Β Β  β”œβ”€β”€ sample_caz_fam.hout
β”‚Β Β  β”œβ”€β”€ sample_caz_fam_stats.tsv
β”‚Β Β  β”œβ”€β”€ sample_caz_sub_annot.tsv
β”‚Β Β  β”œβ”€β”€ sample_caz_sub.domtblout
β”‚Β Β  β”œβ”€β”€ sample_caz_sub.hout
β”‚Β Β  └── sample_caz_sub_stats.tsv
β”œβ”€β”€ fun # Pfam (uproc)
β”‚Β Β  β”œβ”€β”€ sample_fun_annot.tsv
β”‚Β Β  β”œβ”€β”€ sample_fun_stats.tsv
β”‚Β Β  └── sample.uout
β”œβ”€β”€ hyd # Hydrocarbon degradation enzymes (hmmsearch)
β”‚Β Β  β”œβ”€β”€ sample.domtblout
β”‚Β Β  β”œβ”€β”€ sample.hout
β”‚Β Β  β”œβ”€β”€ sample_hyd_annot.tsv
β”‚Β Β  └── sample_hyd_stats.tsv
β”œβ”€β”€ ncy # Nitrogen cycling genes (diamond)
β”‚Β Β  β”œβ”€β”€ sample.blout
β”‚Β Β  β”œβ”€β”€ sample_ncy_annot.tsv
β”‚Β Β  └── sample_ncy_stats.tsv
β”œβ”€β”€ nuc # Nucleotide composition 
β”‚Β Β  β”œβ”€β”€ sample.compseq
β”‚Β Β  β”œβ”€β”€ sample_gc_stats.tsv
β”‚Β Β  β”œβ”€β”€ sample.info.gz
β”‚Β Β  └── sample_nuc_comp
β”œβ”€β”€ orf # Open Reading Frames (FragGeneScanRs)
β”‚Β Β  β”œβ”€β”€ sample_aa_comp.tsv
β”‚Β Β  β”œβ”€β”€ sample_codon_comp.tsv
β”‚Β Β  β”œβ”€β”€ sample.cusp
β”‚Β Β  β”œβ”€β”€ sample.faa.gz
β”‚Β Β  β”œβ”€β”€ sample.ffn.gz
β”‚Β Β  └── sample_orf_stats.tsv
β”œβ”€β”€ pcy # Phosphorus cycling genes (diamond)
β”‚Β Β  β”œβ”€β”€ sample.blout
β”‚Β Β  β”œβ”€β”€ sample_pcy_annot.tsv
β”‚Β Β  └── sample_pcy_stats.tsv
β”œβ”€β”€ pls # Plastic degradation enzymes (diamond)
β”‚Β Β  β”œβ”€β”€ sample.blout
β”‚Β Β  β”œβ”€β”€ sample_pls_annot.tsv
β”‚Β Β  └── sample_pls_stats.tsv
β”œβ”€β”€ res # Antibiotic resistance genes (hmmsearch)
β”‚Β Β  β”œβ”€β”€ sample.domtblout
β”‚Β Β  β”œβ”€β”€ sample.hout
β”‚Β Β  β”œβ”€β”€ sample_res_annot.tsv
β”‚Β Β  └── sample_res_stats.tsv
└── tax # Taxonomy (naive bayes classifier)
    β”œβ”€β”€ sample_centroids.fasta
    β”œβ”€β”€ sample_div.tsv
    β”œβ”€β”€ sample_sample2otu2abund2taxa.tsv
    β”œβ”€β”€ sample_subseq.fasta
    └── sample.uclust

The acn and ags outputs are explained here.

Functional composition (i.e., bgc, caz, fun, hyd, ncy, pcy, and res):
The *_stats.tsv files have the shannon diversity, richness and percentage of ORFs annotated.
The *_annot.tsv files are the gene counts tables.
The *.domtblout, *.hout and the hmmsearch outputs, and the *.uout and *.blout are uproc and diamond outputs, resp.

Nucleotide composition (i.e., nuc):
*.compseq is the compseq (EMBOSS) output.
*.nuc_comp is the tetranucleotide tab formatted output.
*.info is the infoseq (EMBOSS) output.
*.gc_stats.tsv are the GC mean and variance.

Open Reading Frames annotation (i.e., orf):
*_aa_comp.tsv is the amino acid composition.
*_codon_comp.tsv is the codon composition.
*.cusp is the cusp (EMBOSS) output.
*.faa.gz are the ORF amino acid sequences.
*.ffn.gz are the ORF nucleotide sequences.

Taxonomic annotation (i.e., tax):
*_centroids.fasta are the cluster centroid sequences.
*_div.tsv are diversity and richness estimates.
*_sample2otu2abund2taxa.tsv is the taxonomic annotation output.
*_subseq.fasta 16s rRNA genes extracted from reads.
*.uclust is the uclust output.

πŸ“ˆ Workflow description

Figure 1

Figure 1. Mg-Traits pipeline. The metagenomic traits computed by the Mg-Traits pipeline are divided into four different groups. The first includes the metagenomic traits computed at the nucleotide level: GC content, GC variance, and Tetranucleotide frequency. The second group includes the traits obtained from the open reading frame (ORF) sequence data: ORFs to Base Pairs (BPs) ratio, Codon frequency, Amino acid frequency, and Acidic to basic amino acid ratio. The third group is based on the functional annotation of the ORF amino acid sequences. For this, we use Pfam and another seven specialized databases: Biosynthetic Gene Cluster (BGC) domains, Resfams, CANT-HYD, NCyc, PCyc, PlastidDB, and CAZymes. For each reference database, we compute the Composition, Diversity, Richness, and Percentage of Annotated Genes. Additionally, this group includes the percentage of transcription factors (TFs) and the average genome size AGS. Lastly, in the fourth group are included the taxonomy-related metagenomic traits: average copy number of 16S rRNA genes ACN, Taxonomic Composition, Diversity, and Richness.

πŸ—‚ Project Organization

β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md                                   <- The top-level README for developers using this project.
β”œβ”€β”€ cont_env
β”‚Β Β  β”œβ”€β”€ Dockerfile
β”‚Β Β  β”œβ”€β”€ resources
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ Pfam_v28.0_acc.txt
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ PlasticDB.fasta.gz
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ TF.txt
β”‚Β Β  β”‚Β Β  └── all_cog_lengths.tsv
β”‚Β Β  └── software
β”‚Β Β      └── mg_traits
β”‚Β Β          β”œβ”€β”€ conf.sh
β”‚Β Β          β”œβ”€β”€ funs.sh
β”‚Β Β          β”œβ”€β”€ mg_traits.sh
β”‚Β Β          β”œβ”€β”€ modules
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module10_pcy_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module11_pls_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module1_nuc_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module2_orf_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module3_fun_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module4_tax_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module5_res_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module6_bgc_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module7_caz_mg_traits.sh
β”‚Β Β          β”‚Β Β  β”œβ”€β”€ module8_hyd_mg_traits.sh
β”‚Β Β          β”‚Β Β  └── module9_ncy_mg_traits.sh
β”‚Β Β          └── toolbox
β”‚Β Β              β”œβ”€β”€ acn.sh
β”‚Β Β              β”œβ”€β”€ ags.sh
β”‚Β Β              β”œβ”€β”€ taxa_annot_DADA2.R
β”‚Β Β              └── taxa_annot_rRDP.R
β”œβ”€β”€ figures
β”‚Β Β  β”œβ”€β”€ Mg-Traits2.png
β”‚Β Β  └── Mg_Traits-ENG.png
└── run_mg_traits.sh

πŸš— Dependancies

Mg-Traits utilizes the following tools:
AGS and ACN tools
BBTools DADA2
diamond
EMBOSS
FragGeneScanRs
HMMER
R
seqtk
SortMeRNA
tidyverse
UProC
VSEARCH

and databases:
BGC domains
CANT-HYD
dbCAN and dbCAN-sub
NCYc
PCyc
Pfam (UProC format)
PlasticDB
Resfams
Silva SSU nr99 (DADA2 format)

πŸ“ Please Cite

Pereira-Flores E, Barberan A, GlΓΆckner FO, Fernandez-Guerra A (2021) Mg-Traits pipeline: advancing functional trait-based approaches in metagenomics. ARPHA Conference Abstracts 4: e64908. https://doi.org/10.3897/aca.4.e64908

πŸ“² Contact

Please reach out with any comments, concerns, or discussion regarding Mg-Traits. It is primarly maintained by Emliano Perea for NewAtlantis Labs.

Discord Twitter Email ORCiD

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published