-
Notifications
You must be signed in to change notification settings - Fork 43
Snakefile config
The config.yaml
file contains a list of parameters that are read in by the Snakefile. Instead of editing the Snakefile whenever you want to try to change some parameter, just create a new copy of the config.yaml
file. Now thats what I call reproducibility.
The config.yaml
file looks something like this:
path:
root: /path/to/your/project/folder/on/the/cluster
scratch: $SCRATCH_FOLDER_VARIABLE_SPECIFIC_TO_YOUR_CLUSTER
folder:
data: dataset
logs: logs
assemblies: assemblies
...
scripts:
kallisto2concoct: kallisto2concoct.py
prepRoary: prepareRoaryInput.R
binFilter: binFilter.py
...
cores:
fastp: 4
megahit: 48
crossMap: 24
...
params:
cutfasta: 10000
assemblyPreset: meta-sensitive
assemblyMin: 1000
...
envs:
metagem: metagem
metawrap: metawrap
prokkaroary: prokkaroary
Let's now look at each category in this config file.
The root
path will be automatically set by the metaGEM.sh
parser to be the current folder you are submitting jobs from.
This is where folders will be created to store the generated files:
~/cluster_login_home/
|-project_X/
|--root/
|---logs
|---dataset
|---qfiltered
|---assemblies
...
The scratch
path is cluster specific, and you will likely need to consult your the wiki for your institutions cluster to determine how it should be set. Generally there should be some directory for high I/O jobs, usually called something like $SCRATCHDIR
or $TMPDIR
or $TMP
. The Snakefile
assumes that this variable has a unique location for each job submission. You should not set the scratch
path to be a specific directory if you are submitting jobs in parallel, as this may result in multiple jobs copying and reading files from the same temporary directory and result in errors.
This is simply a list of all the subfolders that are used/created throughout the metaGEM
workflow. You can generate these folders by running:
metaGEM.sh -t createFolders
This contains a list of all the scripts or important files present in the scripts
folder that are used throughout the metaGEM
workflow.
This lists the number of cores you wish to allocate to each task or job. Note that these values are read by the Snakefile, while the number of cores requested in the metaGEM.sh
parser or the cluster_config.json
file are read by the cluster workload manager. You should carefully ensure that these values match when submitting jobs.
Let's say that we want to submit an assembly job with megahit
, and the config.yaml
file has 48 cores allocated to megahit
by default as shown at the top of this page. If we were to run the following code:
metaGEM.sh -t megahit -j 1 -c 2 -m 12 -t 2
Then the metaGEM.sh
parser would request 1 job with 2 cores + 12 GB RAM with a max runtime of 2 hours. Once the job starts, the Snakefile rule megahit
will start running. The megahit
call is implemented like so:
megahit -t {config[cores][megahit]} \
--presets {config[params][assemblyPreset]} \
--verbose \
--min-contig-len {config[params][assemblyMin]} \
-1 $(basename {input.R1}) \
-2 $(basename {input.R2}) \
-o tmp;
As you can see from the -t
flag parameter, the job will look into the config.yaml
file to see what the value of megahit cores is set to. Like we mentioned, by default this is set at 48 cores, but the job submission only requested 2 cores. In such a case, since the job only has access to the resources requested, only 2 cores will be used instead of the desired 48.
Even worse would be to request 48 cores for a certain task via a metaGEM.sh
call, but then the config.yaml
file only specifies 2 cores for that given task. In this case you would be wasting 46 cores and your cluster administrators will be very upset with you.
The moral of the story here is to make sure that when you use the metaGEM.sh
parser for a particular task, you request the number of cores that is specified in the config.yaml
file
Contains a list of the parameters used by the various individual tools.
This simply contains a list of the conda environment names used within the Snakefile
rules. For example, you will notice that most rules in the Snakefile
start with activating a certain environment:
set +u;source activate {config[envs][metagem]};set -u;
- Quality filter reads with fastp
- Assembly with megahit
- Draft bin sets with CONCOCT, MaxBin2, and MetaBAT2
- Refine & reassemble bins with metaWRAP
- Taxonomic assignment with GTDB-tk
- Relative abundances with bwa
- Reconstruct & evaluate genome-scale metabolic models with CarveMe and memote
- Species metabolic coupling analysis with SMETANA