-
Notifications
You must be signed in to change notification settings - Fork 0
Running pipeline
MetaRefSGB requires a set of mandatory and optional arguments in order to organise your genomes in species-, genus-, and family-level genome bins. You can inspect the whole set of arguments by typing the following command in your terminal:
MetaRefSGB --help
It will show the list of available arguments with a brief explanation about their meaning:
NAME
MetaRefSGB -- organise genomes into species-level genome bins
VERSION
1.0 (20211020)
SYNOPSIS
MetaRefSGB [--work-dir=directory] [--label=value] [--release=value] [--mags=file] [--references=file]
[--input-dir=directory] [--extension=value] [--db=directory] [--nproc=num] [--xargs-nproc=num]
[--mash-threshold=num] [--checkm-completeness=num] [--checkm-contamination=num]
DESCRIPTION
MetaRefSGB is a scalable framework for organising genomes into species-level genome bins.
Please visit the official Wiki for additional details:
https://github.com/SegataLab/MetaRefSGB/wiki
The following options are available:
--work-dir=directory
Path to the working directory in which results will be located.
--label=value
Label for the new release.
--release=value
Label of the reference release.
Use --release=none to build a release from scratch.
--mags=file
Path to the file with the list of input MAGs.
--references=file
Path to the file with the list of input Reference Genomes.
--remove-genomes=file
Path to the file with the list of genomes that must be removed from MetaRefSGB.
--move-to-mags=file
Path to the file with the list of reference genomes that must be considered as MAGs.
--move-to-references=file
Path to the file with the list of MAGs that must be considered as reference genomes.
--reassign-genomes=file
Path to the file with the list of genomes that must be reassigned to different SGBs.
--genome=value
MetaRefSGB Unique Genome Identifier.
Must be used in conjunction with --inspect only.
--sample=value
Sample ID.
Must be used in conjunction with --inspect only.
--dataset=value
Dataset ID.
Must be used in conjunction with --inspect only.
--cluster=value
Cluster ID (SGB, GGB, or FGB).
Must be used in conjunction with --inspect only.
--file=file
Path to a one-column file with a list of MetaRefSGB Unique Genome Identifiers, samples, or datasets.
Must be used in conjunction with --inspect only.
--schema=value
MetaRefSGB Data Model schema (MAG, genome, or metadata).
Must be used in conjunction with --inspect only.
--output=file
Path to the file with the output of the --inspect command.
Must be used in conjunction with --inspect only.
--inspect
Retrieve information about genomes, samples, datasets, or clusters
--metadata=file
Path to the file with metadata about metagenomic samples.
Must be used in conjunction with --validate-input only.
--input-dir=directory
Path to the folder with input genomes.
--extension=value
File extension of the input genomes.
--supported-extensions
Print the list of supported file extensions for input genomes.
--db=directory
Directory with the MetaRefSGB framework and databases.
--nproc=num
Max nproc for parallel instructions.
--xargs-nproc=num
Max parallel xargs jobs for MASH disting.
--mash-threshold=num
Filter threshold on the MASH distance.
--checkm-completeness=num
Filter threshold on CheckM completeness.
--checkm-contamination=num
Filter threshold on CheckM contamination.
--default
Automatically set --nproc=8, --xargs-nproc=1, --mash-threshold=0.001, --checkm-completeness=50.0, and --checkm-contamination=5.0.
Remember to always use this flag before one of the above arguments, otherwise it will overwrite them with their default values.
--validate-input
Used in conjunction with --mags, --references, and --metadata arguments. Check whether input files are properly formatted.
Input GCAs will be tested against the RefSeq exclusion criteria to check whether they are Reference Genomes or MAGs.
Data will be validated against the MetaRefSGB Data Model (MDM).
https://github.com/SegataLab/MetaRefSGB/wiki/MDM-Schema
--retrieve-taxa=file
Automatically retrieve Reference Genomes taxonomic labels and NCBI taxa IDs.
https://github.com/SegataLab/MetaRefSGB/wiki/Running-pipeline
--skip-filter
Skip the filtering process and insert all the input genomes into the clustering configuration.
--use-filter=file
Skip the filtering process and use a precompiled list of genomes as the result of the filter.
--skip-checkm
Skip the quality-control process with CheckM.
--use-checkm=file
Skip CheckM and use a precompiled CheckM output log file.
--resolve-dependencies
Automatically check for external software dependencies and install required python modules.
EXIT STATUS
MetaRefSGB exits with one of the following values:
0 The pipeline has been correctly applied.
>0 An error occurred.
EXAMPLES
Try running MetaRefSGB by typing:
$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
--input-dir=~/mygenomes --extension=fna --db=~/db --default
To expand the --default flag and explicitly set --nproc, --xargs-nproc, --mash-threshold, --checkm-completeness, and --checkm-contamination:
$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
--input-dir=~/mygenomes --extension=fna --db=~/db --nproc=8 --xargs-nproc=1 --mash-threshold=0.001
--checkm-completeness=50.0 --checkm-contamination=5.0
To explicitly change the value of just one of the above arguments, remember to always put the --default flag before specifying any of them.
Otherwise, it will overwrite the explicitly assigned arguments with their default values:
$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
--input-dir=~/mygenomes --extension=fna --db=~/db --default --nproc=16
In order to validate input data, try runnning:
$ MetaRefSGB --mags=~/MAGs.txt --references=~/genomes.txt --metadata=~/metadata.txt --validate-input
To validate just one input data:
$ MetaRefSGB --mags=~/MAGs.txt --validate-input
To retrieve taxonomic labels and NCBI taxa IDs of the input Reference Genomes:
$ MetaRefSGB --retrieve-taxa=~/genomes.txt
To automatically check for external software dependencies and resolve required python modules:
$ MetaRefSGB --resolve-dependencies
To retrieve informations about genomes, samples, datasets, or clusters into the Jan21 release:
$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21
$ MetaRefSGB --inspect --sample=833 --db=~/db --release=Jan21
$ MetaRefSGB --inspect --dataset=AsnicarF_2020 --db=~/db --release=Jan21
$ MetaRefSGB --inspect --cluster=SGB5075 --db=~/db --release=Jan21
To search for multiple genomes, samples, datasets, or clusters with a single run:
$ MetaRefSGB --inspect --file=~/mygenomes.txt --db=~/db --release=Jan21
In order to redirect the output of the --inspect command to a file:
$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21 --output=~/M1663737656.json
To inspect the MetaRefSGB Data Model schemas (MAG, genome, or metadata):
$ MetaRefSGB --inspect --schema=MAG
BUGS
If you encounter a problem while running MetaRefSGB, you may want to have a look at known issues or open a new one:
https://github.com/SegataLab/MetaRefSGB/issues
CREDITS
Please, consider to credit MetaRefSGB by citing:
TBA
Remember to star the MetaRefSGB repository on Github and follow the @cibiocm lab activity on Twitter!
https://github.com/SegataLab/MetaRefSGB
You can also use the special character ?
in order to expand the help of a specific command like:
MetaRefSGB --work-dir=?
This will output the following message:
MetaRefSGB helper: --work-dir=directory
The --work-dir is a folder in which MetaRefSGB will put all the pipeline intermediate outputs
up to the generation of the new clustering configuration.
It must be empty at the beginning, otherwise the pipeline will try to resume a potentially interrupted
run if some required intermediate results exist.
Both relative and absolute paths are allowed.
Please remember to escape the question mark character in case your terminal will try to automatically interpret it:
MetaRefSGB --work-dir=\?
Before running MetaRefSGB, you should organise your genomes first. Both MAGs and Reference Genomes files must be located in the same folder and must have the same file extension. You can easily uniform your genome files extension by typing the following command in your terminal:
INPUTS_DIR=~/mygenomes
CURRENT_EXTENSION="fa"
NEW_EXTENSION="fna"
find ${INPUTS_DIR} \
-type f -iname "*.${CURRENT_EXTENSION}" -follow | xargs -n 1 -i sh -c \
'INPUT={}; \
mv "$INPUT" "${INPUT%.'"${CURRENT_EXTENSION}"'}.'"${NEW_EXTENSION}"'";'
You should assign the path to the folder with your input genomes to the INPUTS_DIR
variable in addition to the current and new file extension to the CURRENT_EXTENSION
and NEW_EXTENSION
respectively before running this code.
Make the genome files extension uniform is a mandatory step in order to properly run the CheckM step of the pipeline for the quality estimation of your input genomes.
Arguments --mags
and --references
are both mandatory and must point to the MAGs and Reference Genomes definition files. They must be properly structured before running the pipeline.
Both of them must contain a column with the genome names (without their file extension). The Reference Genomes definition file must also contains two additional columns, one with the taxonomy labels and one with the NCBI taxa IDs.
The first line of both these files have to start with the #
character and represents the header.
Here is an example of MAGs definition file that must be passed with the --mags
argument:
# mag_id
AsnicarF_2017__MV_FEI1_t1Q14__bin.2
AsnicarF_2017__MV_FEI1_t1Q14__bin.4
AsnicarF_2017__MV_FEI1_t1Q14__bin.6
...
And an example of Reference Genomes definition file that must be passed with the --references
argument:
# genome_id taxonomy taxonomy_id
GCA_000003135 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_longum|t__Bifidobacterium_longum_subsp_longum_ATCC_55813 2|201174|1760|85004|31953|1678|216816|548480
GCA_000003645 k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_cereus|t__Bacillus_cereus_m1293 2|1239|91061|1385|186817|1386|1396|526973
GCA_000003925 k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_mycoides|t__Bacillus_mycoides_DSM_2048 2|1239|91061|1385|186817|1386|1405|526997
...
Other columns in addition to the mandatory ones will not be considered.
In order to automatically retrieve the taxonomic labels and the NCBI taxa IDs of the input reference genomes, you can run the following command in your terminal:
MetaRefSGB --retrieve-taxa=~/genomes.txt
It accepts both flat and uncompressed file in input as well as a BZ2 compressed file, but it will always produce a BZ2 compressed file in output with the prefix corrected_
.
Before running MetaRefSGB, you may want to finally check if both the MAGs and Reference Genomes definition files are properly formatted by typing:
MetaRefSGB --mags=~/MAGs.txt --references=~/genomes.txt --validate-input
Please note that you may want to change the paths specified with the --retrieve-taxa
, --mags
, and --references
arguments in order to you files on your file system.
This will also validate your input data against the MetaRefSGB Data Model (MDM). Have a look at the models area on the GitHub repository or the dedicated wiki page for additional information about MDM.
It may results in a long list of errors in case your input does not respect the MDM specifications. In case you are building a private release, you can just ignore them but be sure that your inputs contain the minimum required columns before running the pipeline (mag_id
for the MAGs definition file and genome_id
, taxonomy
, and taxonomy_id
for the Reference Genomes definition file, as shown in the examples above).
In MetaRefSGB, new releases are always incremental. This means that they will be always generated starting from the clustering configuration of a previously built release as reference that will be updated by the addition of a new set of MAGs and/or Reference Genomes. You can choose the right release that better fit your needs by looking at the releases area of the repository.
We strongly recommend to use the last public available version of the MetaRefSGB releases.
Now that you already organised your input genomes and you correctly formatted bot the MAGs and Reference Genomes definition files, you can finally run the MetaRefSGB pipeline by typing the following command in your terminal:
MetaRefSGB --work-dir=~/myrelease \
--label=Test \
--release=Jan21 \
--mags=~/MAGs.txt \
--references=~/genomes.txt \
--input-dir=~/mygenomes \
--extension=fna \
--db=~/db \
--default
In this specific examples, we selected Jan21 as a reference release. Input genomes are all located under ~/mygenomes
folder and they all have the same fna
file extension. The database directory specified with the --db
argument can initially be empty and will be populated with data related to the version of the MetaRefSGB release specified with the --release
argument. The working directory specified with the --work-dir
argument can also be empty and will be populated while processing the new release.
Note that the --default
flag is required in order to set the optional arguments with their default values. However, you can also expand it by explicitly set the optional arguments like in the following example:
MetaRefSGB --work-dir=~/myrelease \
--label=Test \
--release=Jan21 \
--mags=~/MAGs.txt \
--references=~/genomes.txt \
--input-dir=~/mygenomes \
--extension=fna \
--db=~/db \
--nproc=8 \
--xargs-nproc=1 \
--mash-threhsold=0.001 \
--checkm-completeness=50.0 \
--checkm-contamination=5.0
If you want to explicitly change the value of just one of the optional arguments, you can write something like the following line:
MetaRefSGB --work-dir=~/myrelease \
--label=Test \
--release=Jan21 \
--mags=~/MAGs.txt \
--references=~/genomes.txt \
--input-dir=~/mygenomes \
--extension=fna \
--db=~/db \
--default \
--nproc=10
Remember to always use the --default
flag in case you want to avoid setting the optional arguments with their default values. Also remember to always put the --default
flag before the optional arguments, otherwise it will overwrite the explicitly assigned optional arguments with their default values.
Warning!
Be careful while explicitly set the --xargs-nproc
argument. It is used in conjunction with the --nproc
argument to extremely parallelise the mash dist
operations. In these particular cases, --nproc
is used to parallelise the single MASH instance, while --xargs-nproc
is used to determine how many MASH processes must be run in parallel. Thus, the total number of instanced processes is equals to --xargs-nproc * --nproc
.