Skip to content

metaGOflow overview

Haris Zafeiropoulos edited this page May 10, 2023 · 14 revisions

Welcome to the metaGOflow wiki!

metaGOflow supports:

  • the fast inference of taxonomic profiles from shotgun metagenomics data based on rRNA genes and their mOTUs
  • the functional annotation of the raw reads
  • theis assembly using the MEGAHIT algorithm

metagoworkflow_svg

Input

metaGOflow's main input files are:

  • forward and reverse .fastq files of shotgun metagenomics data, that can be either local or retrieved through an ENA run accession number, and
  • the config.yml file, where the user provides all the necessary parameter values for the workflow to run.

metaGOflow arguments

metaGOflow gets only a short list of arguments through the CLI that are strongly related to how it is going to be performed. You need to specify the raw data to be used. In case you need to fetch private ENA data you need to use the -p flag. If you are using Singularity, you also need to use the -s flag.

Pipeline parameters:
  -f                  Forward reads fasta file path (mandatory if and olny if -e not used).
  -r                  Reverse reads fasta file path (mandatory if and olny if -e not used).
  -e                  ENA run accession number. Its raw data will be fetched and then analysed (if used, -f and -r should not me set). 
  -d                  Output directory name (mandatory).
  -n                  Name of run and prefix to output files (mandatory).
  -s                  Run workflow using Singularity (Docker is the by default container technology). Works as a flag, i.e. by adding -s in your command, Singularity is going to be used
  -p                  Use ENA private data. Works as flag.
  -b                  Keep tmp folder. Works as flag. 

Resources:
  -m                  Memory to use to with toil --defaultMemory. (optional, default ${MEMORY})
  -c                  Number of cpus to use with toil --defaultCores. (optional, default ${NUM_CORES})
  -l                  Limit number of jobs to schedule. (optional, default ${LIMIT_QUEUE})

Here is an example of running metaGOflow with public data in ENA, in a Singularity cluster, without asking for the tmp folder to be kept.

./run_wf.sh -e ERR599171 -d TARA_OCEANS_SAMPLE -n ERR599171 -s 

The config.yml file

This file works as an interface between metaGOflow and the user. In this file, you set which steps you want to perform as well as all the arguments for the tools that will be invoked.

We strongly advised user not to use the default arguments without considering first their data. The default min_length_required is 130 however your sequences might be shorter. This would lead metaGOflow to fail. You need to consider your data first as well as your computing environment, especially for the case of the functional annotation step, and fill in the config.yml file properly.

Output

metaGOflow will return a .zip file that is a compressed RO-Crate. This is an example case of the .zip content from a complete run of the workflow:


├── config.yml
├── ERR599171.yml
├── results
│   ├── ERR599171_1.fastq.trimmed.fasta
│   ├── ERR599171_1.fastq.trimmed.qc_summary
│   ├── ERR599171_2.fastq.trimmed.fasta
│   ├── ERR599171_2.fastq.trimmed.qc_summary
│   ├── ERR599171.merged_CDS.faa
│   ├── ERR599171.merged_CDS.ffn
│   ├── ERR599171.merged.cmsearch.all.tblout.deoverlapped
│   ├── ERR599171.merged.fasta
│   ├── ERR599171.merged.motus.tsv
│   ├── ERR599171.merged.qc_summary
│   ├── ERR599171.merged.unfiltered_fasta
│   ├── fastp.html
│   ├── final.contigs.fa
│   ├── functional-annotation
│   │   ├── ERR599171.merged_CDS.I5.tsv.chunks
│   │   ├── ERR599171.merged_CDS.I5.tsv.gz
│   │   ├── ERR599171.merged.hmm.tsv.chunks
│   │   ├── ERR599171.merged.hmm.tsv.gz
│   │   ├── ERR599171.merged.summary.go
│   │   ├── ERR599171.merged.summary.go_slim
│   │   ├── ERR599171.merged.summary.ips
│   │   ├── ERR599171.merged.summary.ko
│   │   ├── ERR599171.merged.summary.pfam
│   │   ├── ERR599171.merged.emapper.summary.eggnog
│   │   └── stats
│   │       ├── go.stats
│   │       ├── interproscan.stats
│   │       ├── ko.stats
│   │       ├── orf.stats
│   │       └── pfam.stats
│   ├── RNA-counts
│   ├── sequence-categorisation
│   │   ├── 5_8S.fa.gz
│   │   ├── alpha_tmRNA.RF01849.fasta.gz
│   │   ├── Bacteria_large_SRP.RF01854.fasta.gz
│   │   ├── Bacteria_small_SRP.RF00169.fasta.gz
│   │   ├── cyano_tmRNA.RF01851.fasta.gz
│   │   ├── LSU_rRNA_archaea.RF02540.fa.gz
│   │   ├── LSU_rRNA_bacteria.RF02541.fa.gz
│   │   ├── LSU_rRNA_eukarya.RF02543.fa.gz
│   │   ├── RNaseP_bact_a.RF00010.fasta.gz
│   │   ├── SSU_rRNA_archaea.RF01959.fa.gz
│   │   ├── SSU_rRNA_bacteria.RF00177.fa.gz
│   │   ├── SSU_rRNA_eukarya.RF01960.fa.gz
│   │   ├── tmRNA.RF00023.fasta.gz
│   │   ├── tRNA.RF00005.fasta.gz
│   │   └── tRNA-Sec.RF01852.fasta.gz
│   └── taxonomy-summary
│       ├── LSU
│       │   ├── ERR599171.merged_LSU.fasta.mseq.gz
│       │   ├── ERR599171.merged_LSU.fasta.mseq_hdf5.biom
│       │   ├── ERR599171.merged_LSU.fasta.mseq_json.biom
│       │   ├── ERR599171.merged_LSU.fasta.mseq.tsv
│       │   ├── ERR599171.merged_LSU.fasta.mseq.txt
│       │   └── krona.html
│       └── SSU
│           ├── ERR599171.merged_SSU.fasta.mseq.gz
│           ├── ERR599171.merged_SSU.fasta.mseq_hdf5.biom
│           ├── ERR599171.merged_SSU.fasta.mseq_json.biom
│           ├── ERR599171.merged_SSU.fasta.mseq.tsv
│           ├── ERR599171.merged_SSU.fasta.mseq.txt
│           └── krona.html
└── ro-crate-metadata.json

The ro-crate-metadata.json file includes metadata about the sample (link to its ENA record) and about the metaGOflow version. A copy of the config.yml file is also included so one can reproduce the analysis.