Core ELEment Bias Removal In Metagenome Binned ORthologs
A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).
- Snakemake
- MMseqs2
- Bakta
- Biopython
- CheckM
- Pandas
- Rust toolchain
- Panaroo
NOTE: Conda is used to call different environments and dependencies (see Snakemake file).
Install the required packages using conda/mamba:
git clone https://github.com/bacpop/CELEBRIMBOR.git
cd CELEBRIMBOR
mamba env create -f environment.yml
mamba activate celebrimbor
Download the required bakta database file:
bakta_db download --output /path/to/database
You can also use the light bakta database if using a suitable version of bakta:
bakta_db download --output /path/to/database --type light
Install cgt (will install cgt_bacpop
executable in ./bin
directory)
cargo install cgt_bacpop --root .
Or to build from source:
git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."
An alternative, if you are having trouble with the above, is to use the CELEBRIMBOR docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:
docker pull samhorsfield96/celebrimbor:main
To run within the container, use the below command, replacing path to output dir
and path to fasta dir
with absolute paths and changing other parameters as required:
docker run -v <path to output dir>:/output -v <path to fasta dir>:/data samhorsfield96/celebrimbor:main snakemake --cores 4 --config genome_fasta=/data output_dir=/output bakta_db=bakta_db/db-light cgt_exe=cgt_bacpop cgt_breaks=0.05,0.95 cgt_error=0.05 clustering_method=panaroo panaroo_stringency=moderate
Note: ensure that clustering_method
and panaroo_stringency
parameters are not in quotes.
Update config.yaml
to specify workflow and directory paths.
core
: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.output_dir
: path to output directory. Does not need to exist prior to running.genome_fasta
: path to directory containing fasta files (must have.fasta
extension).bakta_db
: path to bakta db downloaded above.cgt_exe
: path to cgt executable.cgt_breaks
: frequency for rare/core gene cutoff, e.g.0.1,0.9
, meaning genes predicted at<0.1
frequency will berare
,0.1<=x<0.9
will bemiddle
and>=0.9
will becore
.cgt_error
: sets false assignment rate of gene to particular frequency compartment.clustering_method
: choice of eithermmseqs2
(for speed) orpanaroo
(for accuracy).panaroo_stringency
: Stringency of Panaroo quality control measures. One ofstrict
,moderate
orsensitive
.
Run snakemake (must be in same directory as Snakemake
file):
snakemake --cores <cores>
To test running of the workflow, download this repository, replace path/to
with actual paths, and run:
snakemake --cores 1 --config genome_fasta=test/fasta output_dir=test_output bakta_db=path/to/bakta_db/db-light cgt_exe=path/to/cgt_bacpop cgt_breaks=0.05,0.95 cgt_error=0.05 clustering_method=panaroo panaroo_stringency=moderate
This test directory contains simulated MAGs from Kallonen et al..
The output directory test_output
will contain:
annotated
directory, containing gene annotations from bakta.mmseqs2
orpanaroo
directory, containing gene clusters from mmseqs2 or Panaroo respectively.presence_absence_matrix.txt
, a tab-separated file describing the presence/absence of genes (rows) in each genome (columns).pangenome_summary.tsv
, a tab-separated file detailing gene annotations, frequencies and pre-adjustment frequency compartments in the pangenome.checkm_out.tsv
, a summary file generated by CheckM describing genome completeness and contamination.cgt_output.txt
, a summary file detailing the observed frequency and adjusted frequency compartment of each gene in the pangenome.
This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.
- Predict genes in all FASTA files in given directory using bakta
- Cluster genes using mmseqs2 or Panaroo and generate a gene presence/absence matrix
- Generate a pangenome summary of observed gene frequencies
- Calculate genome completeness using CheckM
- Probabistically assign each gene family as
core|middle|rare
using cgt
When using CELEBRIMBOR, please cite: