-
Notifications
You must be signed in to change notification settings - Fork 8
metaGOflow overview
Welcome to the metaGOflow wiki!
metaGOflow supports:
- the fast inference of taxonomic profiles from shotgun metagenomics data based on rRNA genes and their mOTUs
- the functional annotation of the raw reads
- theis assembly using the MEGAHIT algorithm
metaGOflow's main input files are:
- forward and reverse
.fastq
files of shotgun metagenomics data, that can be either local or retrieved through an ENA run accession number, and - the
config.yml
file, where the user provides all the necessary parameter values for the workflow to run.
metaGOflow
gets only a short list of arguments through the CLI that are strongly related to how it is going to be performed.
You need to specify the raw data to be used.
In case you need to fetch private ENA data you need to use the -p
flag.
If you are using Singularity, you also need to use the -s
flag.
Pipeline parameters:
-f Forward reads fasta file path (mandatory if and olny if -e not used).
-r Reverse reads fasta file path (mandatory if and olny if -e not used).
-e ENA run accession number. Its raw data will be fetched and then analysed (if used, -f and -r should not me set).
-d Output directory name (mandatory).
-n Name of run and prefix to output files (mandatory).
-s Run workflow using Singularity (Docker is the by default container technology). Works as a flag, i.e. by adding -s in your command, Singularity is going to be used
-p Use ENA private data. Works as flag.
-b Keep tmp folder. Works as flag.
Resources:
-m Memory to use to with toil --defaultMemory. (optional, default ${MEMORY})
-c Number of cpus to use with toil --defaultCores. (optional, default ${NUM_CORES})
-l Limit number of jobs to schedule. (optional, default ${LIMIT_QUEUE})
Here is an example of running metaGOflow
with public data in ENA, in a Singularity cluster, without asking for the tmp
folder to be kept.
./run_wf.sh -e ERR599171 -d TARA_OCEANS_SAMPLE -n ERR599171 -s
This file works as an interface between metaGOflow
and the user.
In this file, you set which steps you want to perform as well as
all the arguments for the tools that will be invoked.
We strongly advised user not to use the default arguments without considering first their data.
The default min_length_required
is 130 however your sequences might be shorter.
This would lead metaGOflow
to fail.
You need to consider your data first as well as your computing environment, especially for the case of the functional annotation step, and fill in the config.yml
file properly.
metaGOflow
will return a .zip
file that is a compressed RO-Crate.
This is an example case of the .zip
content from a complete run of the workflow:
├── config.yml
├── ERR599171.yml
├── results
│ ├── ERR599171_1.fastq.trimmed.fasta
│ ├── ERR599171_1.fastq.trimmed.qc_summary
│ ├── ERR599171_2.fastq.trimmed.fasta
│ ├── ERR599171_2.fastq.trimmed.qc_summary
│ ├── ERR599171.merged_CDS.faa
│ ├── ERR599171.merged_CDS.ffn
│ ├── ERR599171.merged.cmsearch.all.tblout.deoverlapped
│ ├── ERR599171.merged.fasta
│ ├── ERR599171.merged.motus.tsv
│ ├── ERR599171.merged.qc_summary
│ ├── ERR599171.merged.unfiltered_fasta
│ ├── fastp.html
│ ├── final.contigs.fa
│ ├── functional-annotation
│ │ ├── ERR599171.merged_CDS.I5.tsv.chunks
│ │ ├── ERR599171.merged_CDS.I5.tsv.gz
│ │ ├── ERR599171.merged.hmm.tsv.chunks
│ │ ├── ERR599171.merged.hmm.tsv.gz
│ │ ├── ERR599171.merged.summary.go
│ │ ├── ERR599171.merged.summary.go_slim
│ │ ├── ERR599171.merged.summary.ips
│ │ ├── ERR599171.merged.summary.ko
│ │ ├── ERR599171.merged.summary.pfam
│ │ ├── ERR599171.merged.emapper.summary.eggnog
│ │ └── stats
│ │ ├── go.stats
│ │ ├── interproscan.stats
│ │ ├── ko.stats
│ │ ├── orf.stats
│ │ └── pfam.stats
│ ├── RNA-counts
│ ├── sequence-categorisation
│ │ ├── 5_8S.fa.gz
│ │ ├── alpha_tmRNA.RF01849.fasta.gz
│ │ ├── Bacteria_large_SRP.RF01854.fasta.gz
│ │ ├── Bacteria_small_SRP.RF00169.fasta.gz
│ │ ├── cyano_tmRNA.RF01851.fasta.gz
│ │ ├── LSU_rRNA_archaea.RF02540.fa.gz
│ │ ├── LSU_rRNA_bacteria.RF02541.fa.gz
│ │ ├── LSU_rRNA_eukarya.RF02543.fa.gz
│ │ ├── RNaseP_bact_a.RF00010.fasta.gz
│ │ ├── SSU_rRNA_archaea.RF01959.fa.gz
│ │ ├── SSU_rRNA_bacteria.RF00177.fa.gz
│ │ ├── SSU_rRNA_eukarya.RF01960.fa.gz
│ │ ├── tmRNA.RF00023.fasta.gz
│ │ ├── tRNA.RF00005.fasta.gz
│ │ └── tRNA-Sec.RF01852.fasta.gz
│ └── taxonomy-summary
│ ├── LSU
│ │ ├── ERR599171.merged_LSU.fasta.mseq.gz
│ │ ├── ERR599171.merged_LSU.fasta.mseq_hdf5.biom
│ │ ├── ERR599171.merged_LSU.fasta.mseq_json.biom
│ │ ├── ERR599171.merged_LSU.fasta.mseq.tsv
│ │ ├── ERR599171.merged_LSU.fasta.mseq.txt
│ │ └── krona.html
│ └── SSU
│ ├── ERR599171.merged_SSU.fasta.mseq.gz
│ ├── ERR599171.merged_SSU.fasta.mseq_hdf5.biom
│ ├── ERR599171.merged_SSU.fasta.mseq_json.biom
│ ├── ERR599171.merged_SSU.fasta.mseq.tsv
│ ├── ERR599171.merged_SSU.fasta.mseq.txt
│ └── krona.html
└── ro-crate-metadata.json
The ro-crate-metadata.json
file includes metadata about the sample (link to its ENA record)
and about the metaGOflow
version.
A copy of the config.yml
file is also included so one can reproduce the analysis.
Anything unclear or inaccurate? Please open an issue or email Dr.Haris Zafeiropoulos (haris.zafeiropoulos@kuleuven.be).
With respect to EMO BON protocols, samples, analyses you may contact the Observation, Data and Service Development Officer of EMBRC, Dr. Ioulia Santi (ioulia.santi@embrc.eu)