Skip to content

Running pipeline

Fabio Cumbo edited this page Oct 20, 2021 · 9 revisions

MetaRefSGB requires a set of mandatory and optional arguments in order to organise your genomes in species-, genus-, and family-level genome bins. You can inspect the whole set of arguments by typing the following command in your terminal:

MetaRefSGB --help

It will show the list of available arguments with a brief explanation about their meaning:

NAME
	MetaRefSGB -- organise genomes into species-level genome bins

VERSION
	1.0 (20211020)

SYNOPSIS
	MetaRefSGB [--work-dir=directory] [--label=value] [--release=value] [--mags=file] [--references=file]
	           [--input-dir=directory] [--extension=value] [--db=directory] [--nproc=num] [--xargs-nproc=num]
	           [--mash-threshold=num] [--checkm-completeness=num] [--checkm-contamination=num]

DESCRIPTION
	MetaRefSGB is a scalable framework for organising genomes into species-level genome bins.
	Please visit the official Wiki for additional details:
	https://github.com/SegataLab/MetaRefSGB/wiki

	The following options are available:

	--work-dir=directory
		Path to the working directory in which results will be located.

	--label=value
		Label for the new release.

	--release=value
		Label of the reference release.
		Use --release=none to build a release from scratch.

	--mags=file
		Path to the file with the list of input MAGs.

	--references=file
		Path to the file with the list of input Reference Genomes.

	--remove-genomes=file
		Path to the file with the list of genomes that must be removed from MetaRefSGB.

	--move-to-mags=file
		Path to the file with the list of reference genomes that must be considered as MAGs.

	--move-to-references=file
		Path to the file with the list of MAGs that must be considered as reference genomes.

	--reassign-genomes=file
		Path to the file with the list of genomes that must be reassigned to different SGBs.

	--genome=value
		MetaRefSGB Unique Genome Identifier.
		Must be used in conjunction with --inspect only.

	--sample=value
		Sample ID.
		Must be used in conjunction with --inspect only.

	--dataset=value
		Dataset ID.
		Must be used in conjunction with --inspect only.

	--cluster=value
		Cluster ID (SGB, GGB, or FGB).
		Must be used in conjunction with --inspect only.

	--file=file
		Path to a one-column file with a list of MetaRefSGB Unique Genome Identifiers, samples, or datasets.
		Must be used in conjunction with --inspect only.

	--schema=value
		MetaRefSGB Data Model schema (MAG, genome, or metadata).
		Must be used in conjunction with --inspect only.

	--output=file
		Path to the file with the output of the --inspect command.
		Must be used in conjunction with --inspect only.

	--inspect
		Retrieve information about genomes, samples, datasets, or clusters

	--metadata=file
		Path to the file with metadata about metagenomic samples.
		Must be used in conjunction with --validate-input only.

	--input-dir=directory
		Path to the folder with input genomes.

	--extension=value
		File extension of the input genomes.

	--supported-extensions
		Print the list of supported file extensions for input genomes.

	--db=directory
		Directory with the MetaRefSGB framework and databases.

	--nproc=num
		Max nproc for parallel instructions.

	--xargs-nproc=num
		Max parallel xargs jobs for MASH disting.

	--mash-threshold=num
		Filter threshold on the MASH distance.

	--checkm-completeness=num
		Filter threshold on CheckM completeness.

	--checkm-contamination=num
		Filter threshold on CheckM contamination.

	--default
		Automatically set --nproc=8, --xargs-nproc=1, --mash-threshold=0.001, --checkm-completeness=50.0, and --checkm-contamination=5.0.
		Remember to always use this flag before one of the above arguments, otherwise it will overwrite them with their default values.

	--validate-input
		Used in conjunction with --mags, --references, and --metadata arguments. Check whether input files are properly formatted.
		Input GCAs will be tested against the RefSeq exclusion criteria to check whether they are Reference Genomes or MAGs.
		Data will be validated against the MetaRefSGB Data Model (MDM).
		https://github.com/SegataLab/MetaRefSGB/wiki/MDM-Schema

	--retrieve-taxa=file
		Automatically retrieve Reference Genomes taxonomic labels and NCBI taxa IDs.
		https://github.com/SegataLab/MetaRefSGB/wiki/Running-pipeline

	--skip-filter
		Skip the filtering process and insert all the input genomes into the clustering configuration.

	--use-filter=file
		Skip the filtering process and use a precompiled list of genomes as the result of the filter.

	--skip-checkm
		Skip the quality-control process with CheckM.

	--use-checkm=file
		Skip CheckM and use a precompiled CheckM output log file.

	--resolve-dependencies
		Automatically check for external software dependencies and install required python modules.

EXIT STATUS
	MetaRefSGB exits with one of the following values:

	0	The pipeline has been correctly applied.
	>0	An error occurred.

EXAMPLES
	Try running MetaRefSGB by typing:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --default

	To expand the --default flag and explicitly set --nproc, --xargs-nproc, --mash-threshold, --checkm-completeness, and --checkm-contamination:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --nproc=8 --xargs-nproc=1 --mash-threshold=0.001
		             --checkm-completeness=50.0 --checkm-contamination=5.0

	To explicitly change the value of just one of the above arguments, remember to always put the --default flag before specifying any of them.
	Otherwise, it will overwrite the explicitly assigned arguments with their default values:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --references=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --default --nproc=16

	In order to validate input data, try runnning:

		$ MetaRefSGB --mags=~/MAGs.txt --references=~/genomes.txt --metadata=~/metadata.txt --validate-input

	To validate just one input data:

		$ MetaRefSGB --mags=~/MAGs.txt --validate-input

	To retrieve taxonomic labels and NCBI taxa IDs of the input Reference Genomes:

		$ MetaRefSGB --retrieve-taxa=~/genomes.txt

	To automatically check for external software dependencies and resolve required python modules:

		$ MetaRefSGB --resolve-dependencies

	To retrieve informations about genomes, samples, datasets, or clusters into the Jan21 release:

		$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --sample=833 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --dataset=AsnicarF_2020 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --cluster=SGB5075 --db=~/db --release=Jan21

	To search for multiple genomes, samples, datasets, or clusters with a single run:

		$ MetaRefSGB --inspect --file=~/mygenomes.txt --db=~/db --release=Jan21

	In order to redirect the output of the --inspect command to a file:

		$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21 --output=~/M1663737656.json

	To inspect the MetaRefSGB Data Model schemas (MAG, genome, or metadata):

		$ MetaRefSGB --inspect --schema=MAG

BUGS
	If you encounter a problem while running MetaRefSGB, you may want to have a look at known issues or open a new one:
	https://github.com/SegataLab/MetaRefSGB/issues

CREDITS
	Please, consider to credit MetaRefSGB by citing:
	TBA

	Remember to star the MetaRefSGB repository on Github and follow the @cibiocm lab activity on Twitter!
	https://github.com/SegataLab/MetaRefSGB

You can also use the special character ? in order to expand the help of a specific command like:

MetaRefSGB --work-dir=?

This will output the following message:

MetaRefSGB helper: --work-dir=directory

    The --work-dir is a folder in which MetaRefSGB will put all the pipeline intermediate outputs
    up to the generation of the new clustering configuration.

    It must be empty at the beginning, otherwise the pipeline will try to resume a potentially interrupted
    run if some required intermediate results exist.

    Both relative and absolute paths are allowed.

Please remember to escape the question mark character in case your terminal will try to automatically interpret it:

MetaRefSGB --work-dir=\?

Organise your genomes

Before running MetaRefSGB, you should organise your genomes first. Both MAGs and Reference Genomes files must be located in the same folder and must have the same file extension. You can easily uniform your genome files extension by typing the following command in your terminal:

INPUTS_DIR=~/mygenomes
CURRENT_EXTENSION="fa"
NEW_EXTENSION="fna"
find ${INPUTS_DIR} \
        -type f -iname "*.${CURRENT_EXTENSION}" -follow | xargs -n 1 -i sh -c \
        'INPUT={}; \
         mv "$INPUT" "${INPUT%.'"${CURRENT_EXTENSION}"'}.'"${NEW_EXTENSION}"'";'

You should assign the path to the folder with your input genomes to the INPUTS_DIR variable in addition to the current and new file extension to the CURRENT_EXTENSION and NEW_EXTENSION respectively before running this code.

Make the genome files extension uniform is a mandatory step in order to properly run the CheckM step of the pipeline for the quality estimation of your input genomes.

Format your MAGs and Reference Genomes definition files

Arguments --mags and --references are both mandatory and must point to the MAGs and Reference Genomes definition files. They must be properly structured before running the pipeline.

Both of them must contain a column with the genome names (without their file extension). The Reference Genomes definition file must also contains two additional columns, one with the taxonomy labels and one with the NCBI taxa IDs.

The first line of both these files have to start with the # character and represents the header.

Here is an example of MAGs definition file that must be passed with the --mags argument:

# mag_id
AsnicarF_2017__MV_FEI1_t1Q14__bin.2
AsnicarF_2017__MV_FEI1_t1Q14__bin.4
AsnicarF_2017__MV_FEI1_t1Q14__bin.6
...

And an example of Reference Genomes definition file that must be passed with the --references argument:

# genome_id	taxonomy	taxonomy_id
GCA_000003135	k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_longum|t__Bifidobacterium_longum_subsp_longum_ATCC_55813	2|201174|1760|85004|31953|1678|216816|548480
GCA_000003645	k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_cereus|t__Bacillus_cereus_m1293	2|1239|91061|1385|186817|1386|1396|526973
GCA_000003925	k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_mycoides|t__Bacillus_mycoides_DSM_2048	2|1239|91061|1385|186817|1386|1405|526997
...

Other columns in addition to the mandatory ones will not be considered.

In order to automatically retrieve the taxonomic labels and the NCBI taxa IDs of the input reference genomes, you can run the following command in your terminal:

MetaRefSGB --retrieve-taxa=~/genomes.txt

It accepts both flat and uncompressed file in input as well as a BZ2 compressed file, but it will always produce a BZ2 compressed file in output with the prefix corrected_.

Before running MetaRefSGB, you may want to finally check if both the MAGs and Reference Genomes definition files are properly formatted by typing:

MetaRefSGB --mags=~/MAGs.txt --references=~/genomes.txt --validate-input

Please note that you may want to change the paths specified with the --retrieve-taxa, --mags, and --references arguments in order to you files on your file system.

This will also validate your input data against the MetaRefSGB Data Model (MDM). Have a look at the models area on the GitHub repository or the dedicated wiki page for additional information about MDM.

It may results in a long list of errors in case your input does not respect the MDM specifications. In case you are building a private release, you can just ignore them but be sure that your inputs contain the minimum required columns before running the pipeline (mag_id for the MAGs definition file and genome_id, taxonomy, and taxonomy_id for the Reference Genomes definition file, as shown in the examples above).

Choose a reference release

In MetaRefSGB, new releases are always incremental. This means that they will be always generated starting from the clustering configuration of a previously built release as reference that will be updated by the addition of a new set of MAGs and/or Reference Genomes. You can choose the right release that better fit your needs by looking at the releases area of the repository.

We strongly recommend to use the last public available version of the MetaRefSGB releases.

Running MetaRefSGB

Now that you already organised your input genomes and you correctly formatted bot the MAGs and Reference Genomes definition files, you can finally run the MetaRefSGB pipeline by typing the following command in your terminal:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --references=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \
           --default

In this specific examples, we selected Jan21 as a reference release. Input genomes are all located under ~/mygenomes folder and they all have the same fna file extension. The database directory specified with the --db argument can initially be empty and will be populated with data related to the version of the MetaRefSGB release specified with the --release argument. The working directory specified with the --work-dir argument can also be empty and will be populated while processing the new release.

Note that the --default flag is required in order to set the optional arguments with their default values. However, you can also expand it by explicitly set the optional arguments like in the following example:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --references=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \
           --nproc=8 \
           --xargs-nproc=1 \
           --mash-threhsold=0.001 \
           --checkm-completeness=50.0 \
           --checkm-contamination=5.0

If you want to explicitly change the value of just one of the optional arguments, you can write something like the following line:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --references=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \
           --default \
           --nproc=10

Remember to always use the --default flag in case you want to avoid setting the optional arguments with their default values. Also remember to always put the --default flag before the optional arguments, otherwise it will overwrite the explicitly assigned optional arguments with their default values.


Warning!

Be careful while explicitly set the --xargs-nproc argument. It is used in conjunction with the --nproc argument to extremely parallelise the mash dist operations. In these particular cases, --nproc is used to parallelise the single MASH instance, while --xargs-nproc is used to determine how many MASH processes must be run in parallel. Thus, the total number of instanced processes is equals to --xargs-nproc * --nproc.