Skip to content

metaGOflow overview

Haris Zafeiropoulos edited this page Jun 30, 2023 · 14 revisions

Welcome to the metaGOflow wiki!

metaGOflow supports:

  • the fast inference of taxonomic profiles from shotgun metagenomics data based on rRNA genes and their mOTUs
  • the functional annotation of the raw reads
  • theis assembly using the MEGAHIT algorithm

metagoworkflow_svg

Input

metaGOflow's main input files are:

  • forward and reverse .fastq files of shotgun metagenomics data, that can be either local or retrieved through an ENA run accession number, and
  • the config.yml file, where the user provides all the necessary parameter values for the workflow to run.

metaGOflow arguments

metaGOflow gets only a short list of arguments through the CLI that are strongly related to how it is going to be performed. You need to specify the raw data to be used. In case you need to fetch private ENA data you need to use the -p flag. If you are using Singularity, you also need to use the -s flag.

Pipeline parameters:
  -f                  Forward reads fasta file path (mandatory if and olny if -e not used).
  -r                  Reverse reads fasta file path (mandatory if and olny if -e not used).
  -e                  ENA run accession number. Its raw data will be fetched and then analysed (if used, -f and -r should not me set). 
  -d                  Output directory name (mandatory).
  -n                  Name of run and prefix to output files (mandatory).
  -s                  Run workflow using Singularity (Docker is the by default container technology). Works as a flag, i.e. by adding -s in your command, Singularity is going to be used
  -p                  Use ENA private data. Works as flag.
  -b                  Keep tmp folder. Works as flag. 

Resources:
  -m                  Memory to use to with toil --defaultMemory. (optional, default ${MEMORY})
  -c                  Number of cpus to use with toil --defaultCores. (optional, default ${NUM_CORES})
  -l                  Limit number of jobs to schedule. (optional, default ${LIMIT_QUEUE})

Here is an example of running metaGOflow with public data in ENA, in a Singularity cluster, without asking for the tmp folder to be kept.

./run_wf.sh -e ERR599171 -d TARA_OCEANS_SAMPLE -n ERR599171 -s 

The config.yml file

This file works as an interface between metaGOflow and the user. In this file, you set which steps you want to perform as well as all the arguments for the tools that will be invoked.

We strongly advised user not to use the default arguments without considering first their data. The default min_length_required is 130 however your sequences might be shorter. This would lead metaGOflow to fail. You need to consider your data first as well as your computing environment, especially for the case of the functional annotation step, and fill in the config.yml file properly.

Output

metaGOflow will return a .zip file that is a compressed RO-Crate. This is an example case of the .zip content from a complete run of the workflow:


├── config.yml
├── ERR599171.yml
├── results
│   ├── ERR599171_1.fastq.trimmed.fasta
│   ├── ERR599171_1.fastq.trimmed.qc_summary
│   ├── ERR599171_2.fastq.trimmed.fasta
│   ├── ERR599171_2.fastq.trimmed.qc_summary
│   ├── ERR599171.merged_CDS.faa
│   ├── ERR599171.merged_CDS.ffn
│   ├── ERR599171.merged.cmsearch.all.tblout.deoverlapped
│   ├── ERR599171.merged.fasta
│   ├── ERR599171.merged.motus.tsv
│   ├── ERR599171.merged.qc_summary
│   ├── ERR599171.merged.unfiltered_fasta
│   ├── fastp.html
│   ├── final.contigs.fa
│   ├── functional-annotation
│   │   ├── ERR599171.merged_CDS.I5.tsv.gz
│   │   ├── ERR599171.merged.hmm.tsv.gz
│   │   ├── ERR599171.merged.summary.go
│   │   ├── ERR599171.merged.summary.go_slim
│   │   ├── ERR599171.merged.summary.ips
│   │   ├── ERR599171.merged.summary.ko
│   │   ├── ERR599171.merged.summary.pfam
│   │   ├── ERR599171.merged.emapper.summary.eggnog
│   │   └── stats
│   │       ├── go.stats
│   │       ├── interproscan.stats
│   │       ├── ko.stats
│   │       ├── orf.stats
│   │       └── pfam.stats
│   ├── RNA-counts
│   ├── sequence-categorisation
│   │   ├── 5_8S.fa.gz
│   │   ├── alpha_tmRNA.RF01849.fasta.gz
│   │   ├── Bacteria_large_SRP.RF01854.fasta.gz
│   │   ├── Bacteria_small_SRP.RF00169.fasta.gz
│   │   ├── cyano_tmRNA.RF01851.fasta.gz
│   │   ├── LSU_rRNA_archaea.RF02540.fa.gz
│   │   ├── LSU_rRNA_bacteria.RF02541.fa.gz
│   │   ├── LSU_rRNA_eukarya.RF02543.fa.gz
│   │   ├── RNaseP_bact_a.RF00010.fasta.gz
│   │   ├── SSU_rRNA_archaea.RF01959.fa.gz
│   │   ├── SSU_rRNA_bacteria.RF00177.fa.gz
│   │   ├── SSU_rRNA_eukarya.RF01960.fa.gz
│   │   ├── tmRNA.RF00023.fasta.gz
│   │   ├── tRNA.RF00005.fasta.gz
│   │   └── tRNA-Sec.RF01852.fasta.gz
│   └── taxonomy-summary
│       ├── LSU
│       │   ├── ERR599171.merged_LSU.fasta.mseq.gz
│       │   ├── ERR599171.merged_LSU.fasta.mseq_hdf5.biom
│       │   ├── ERR599171.merged_LSU.fasta.mseq_json.biom
│       │   ├── ERR599171.merged_LSU.fasta.mseq.tsv
│       │   ├── ERR599171.merged_LSU.fasta.mseq.txt
│       │   └── krona.html
│       └── SSU
│           ├── ERR599171.merged_SSU.fasta.mseq.gz
│           ├── ERR599171.merged_SSU.fasta.mseq_hdf5.biom
│           ├── ERR599171.merged_SSU.fasta.mseq_json.biom
│           ├── ERR599171.merged_SSU.fasta.mseq.tsv
│           ├── ERR599171.merged_SSU.fasta.mseq.txt
│           └── krona.html
└── ro-crate-metadata.json

Data product Description
├── config.yml metaGOflow configuration file
├── ERR599171.yml
├── results
│ ├── ERR599171_1.fastq.trimmed.fasta Filtered .fastq file of the forward (R1) reads
│ ├── ERR599171_1.fastq.trimmed.qc_summary Summary with statistics of the forward (R1) reads
│ ├── ERR599171_2.fastq.trimmed.fasta Filtered .fastq file of the reverse (R2) reads
│ ├── ERR599171_2.fastq.trimmed.qc_summary Summary with statistics of the reverse (R2) reads
│ ├── ERR599171.merged_CDS.faa Aminoacid coding sequences
│ ├── ERR599171.merged_CDS.ffn Nucleotide coding sequences
│ ├── ERR599171.merged.cmsearch.all.tblout.deoverlapped Sequence hits against covariance model databases
│ ├── ERR599171.merged.fasta Merged filtered sequences
│ ├── ERR599171.merged.motus.tsv Merged sequences MOTUs
│ ├── ERR599171.merged.qc_summary Quality control (QC) summary of the merged sequences
│ ├── ERR599171.merged.unfiltered_fasta Merged sequences that did not pass the filtering
│ ├── fastp.html FASTP analysis of raw sequence data
│ ├── final.contigs.fa FASTA formatted contig sequences
│ ├── functional-annotation Functional annotation results
│ │ ├── ERR599171.merged_CDS.I5.tsv.chunks
│ │ ├── ERR599171.merged_CDS.I5.tsv.gz Merged contigs CDS I5 summary
│ │ ├── ERR599171.merged.hmm.tsv.chunks
│ │ ├── ERR599171.merged.hmm.tsv.gz Merged contigs HMM summary
│ │ ├── ERR599171.merged.summary.go Gene Ontology annotation summary
│ │ ├── ERR599171.merged.summary.go_slim GO slim annotation summary
│ │ ├── ERR599171.merged.summary.ips InterProScan annotation summary
│ │ ├── ERR599171.merged.summary.ko KO annotation summary
│ │ ├── ERR599171.merged.summary.pfam Pfam annotation summary
│ │ ├── ERR599171.merged.emapper.summary.eggnog eggNOG annotation summary
│ │ └── stats
│ │ ├── go.stats Gene Ontology (GO) annotation summary statistics
│ │ ├── interproscan.stats InterProScan annotation summary statistics
│ │ ├── ko.stats Kegg Orthology (KO) annotation summary statistics
│ │ ├── orf.stats Open Reading Frame (ORF) annotation summary statistics
│ │ └── pfam.stats Pfam annotation summary statistics
│ ├── RNA-counts Numbers of RNAs counted
│ ├── sequence-categorisation Sequence categorisation
│ │ ├── 5_8S.fa.gz 5.8S ribosomal RNA sequences
│ │ ├── alpha_tmRNA.RF01849.fasta.gz Predicted Alphaproteobacteria transfer-messenger RNA (RF01849)
│ │ ├── Bacteria_large_SRP.RF01854.fasta.gz Predicted Bacterial large signal recognition particle RNA (RF01854)
│ │ ├── Bacteria_small_SRP.RF00169.fasta.gz Predicted Bacterial small signal recognition particle RNA (RF00169)
│ │ ├── cyano_tmRNA.RF01851.fasta.gz Predicted Cyanobacteria transfer-messenger RNA (RF01851)
│ │ ├── LSU_rRNA_archaea.RF02540.fa.gz Predicted Archaeal large subunit ribosomal RNA (RF02540)
│ │ ├── LSU_rRNA_bacteria.RF02541.fa.gz Predicted Bacterial large subunit ribosomal RNA (RF02541)
│ │ ├── LSU_rRNA_eukarya.RF02543.fa.gz Predicted Eukaryotic large subunit ribosomal RNA (RF02543)
│ │ ├── RNaseP_bact_a.RF00010.fasta.gz Predicted Bacterial RNase P class A (RF00010)
│ │ ├── SSU_rRNA_archaea.RF01959.fa.gz Predicted Archaeal small subunit ribosomal RNA (RF01959)
│ │ ├── SSU_rRNA_bacteria.RF00177.fa.gz Predicted Bacterial small subunit ribosomal RNA (RF00177)
│ │ ├── SSU_rRNA_eukarya.RF01960.fa.gz Predicted Eukaryotic small subunit ribosomal RNA (RF01960)
│ │ ├── tmRNA.RF00023.fasta.gz Predicted transfer-messenger RNA (RF00023)
│ │ ├── tRNA.RF00005.fasta.gz Predicted transfer RNA (RF00005)
│ │ └── tRNA-Sec.RF01852.fasta.gz Predicted Selenocysteine transfer RNA (RF01852)
│ └── taxonomy-summary
│ ├── LSU
│ │ ├── ERR599171.merged_LSU.fasta.mseq.gz LSU rRNA sequences used for taxonomic indentification
│ │ ├── ERR599171.merged_LSU.fasta.mseq_hdf5.biom OTUs and taxonomic assignments for LSU rRNA (hdf5 formatted BIOM)
│ │ ├── ERR599171.merged_LSU.fasta.mseq_json.biom OTUs and taxonomic assignments for LSU rRNA (json formatted BIOM)
│ │ ├── ERR599171.merged_LSU.fasta.mseq.tsv Tab-separated formatted taxon counts for LSU rRNA sequences
│ │ ├── ERR599171.merged_LSU.fasta.mseq.txt Text-based taxon counts for LSU rRNA sequences
│ │ └── krona.html Ιnteractive krona charts for LSU rRNA taxonomic inventory
│ └── SSU
│ ├── ERR599171.merged_SSU.fasta.mseq.gz SSU rRNA sequences used for taxonomic indentification
│ ├── ERR599171.merged_SSU.fasta.mseq_hdf5.biom OTUs and taxonomic assignments for SSU rRNA (hdf5 formatted BIOM)
│ ├── ERR599171.merged_SSU.fasta.mseq_json.biom OTUs and taxonomic assignments for SSU rRNA (json formatted BIOM)
│ ├── ERR599171.merged_SSU.fasta.mseq.tsv Tab-separated formatted taxon counts for SSU rRNA sequences
│ ├── ERR599171.merged_SSU.fasta.mseq.txt Text-based taxon counts for SSU rRNA sequences
│ └── krona.html Ιnteractive krona charts for SSU rRNA taxonomic inventory
└── ro-crate-metadata.json JSON-LD file describing the structure of the RO-Crate

The ro-crate-metadata.json file includes metadata about the sample (link to its ENA record) and about the metaGOflow version. A copy of the config.yml file is also included, so one can reproduce the analysis.