This repository is currently under active development. New features and documentation are coming soon.
CACY is a command-line tool for the Phylogenetic and Taxonomic analysis of closely related organisms.
For Phylogenetic analysis, the tool uses alignment-free methods to construct Phylogenetic trees based on the amino acid sequences from core genes. Given a list of proteomes in fasta format from various species, it performs clustering with all the proteins and selects those from core genomes. Then, the pipeline feeds them into alignment-free methods to generate the Phylogenetic tree (or splits). For Taxonomic analysis, the tool calculates the pairwise Average Nucleotide Identity (ANI) or Percentage Of Conserved Proteins (POCP) values, and then reports strict Operational Taxonomic Units (OTUs) using the graph-based algorithm.
You can find a more detailed explanation of the tool on ReadTheDocs.
CACY is installable from conda:
git clone https://github.com/garrison-chen/CACY.git && cd CACY
conda env create --file=environments.yaml
conda activate cacy
Next, run the following command to install the additional dependencies:
git clone https://github.com/gi-bielefeld/sans.git
cd sans
make
CACY is designed to run several workflows as integrations of different modules, the latter can also be run individually. With a list of closely-related strains (proteomes or genomes) as input:
Workflow: easy-core-phylo
Modules: cluster > distribute > extract > phylo
Run this workflow if you want to construct a Phylogenetic tree using the core genes. This workflow performs clustering with all the emsumbled proteins and selects those from core genomes. Then the workflow feeds them into the alignment-free methods to efficiently generate the Phylogenetic trees.
python CACY.py easy-core-phylo -i input_directoty -o output_directory -c clusering_option -f threshold
Workflow: easy-compare-sotu
Modules: compare > sotu
Run this workflow if you want to calculate the pairwise ANI or POCP values and report strict OTUs. This workflow uses fastANI and POCP to calculate pairwise ANI and POCP values and store the results to a phylip-formatted lower triangle matrix. This matrix is then converted to an adjacency matrix according to a user-defined cutoff. Next, the workflow turns the adjacency matrix into an undirected graph and applies the Bron-Kerbosch algorithm using solver from NetworkX to calculate all the maximal cliques as the strict OTU groups.
python CACY.py easy-compare-sotu -i input_directory -o output_directory -c clustering_option
Workflow: easy-compare-phylo
Module: compare > phylo
Run this workflow if you want to construct Phylogenetic trees using the pairwise ANI or POCP values. Similar to the previous workflow, this one uses fastANI and POCP to calculate pairwise ANI and POCP values. Then the workflow applies the neighbour-joining algorithm to construct the phylogenetic trees.
python CACY.py easy-compare-phylo -i input_directory -o output_directory -m similarity_metrix
Run the following workflow if you want to identify the cutoffs for separating the specific taxon.
python CACY.py easy-todo
The full usage is shown below:
CACY (Core genes Alignment-free phylogeny and Capture of taxonomY relationship), V1.0.0, Mar 2025
WORKFLOW:
[easy-core-phylo] [cluster] > [distribute] > [extract] > [phylo]
[easy-compare-sotu] [compare] > [sotu]
COMMANDS (core modules):
[cluster] Perform clustering on the input amino acid sequences
[distribute] Create the universal gene frequency distribution U-shape plot
[extract] Select and extract the core-genes amino acid sequences from each proteome
[compare] Calculate the pairwise similarities among the given strains using POCP or ANI
[sotu] Report the strict OTU (sOTU) groups
[phylo] Construct the Phylogenetic tree
[hgt] Detect the horizontal gene transfer
COMMANDS (auxiliary modules):
[taxon-search] search for the NCBI taxon id/name
[download] download the NCBI RefSeq data
[annotate] perform genome annotation
Usage: python CACY.py COMMANDS/WORKFLOW [OPTIONS]
Possible [OPTIONS] for COMMANDS/WORKFLOW can be seen with syntax: python CACY.py COMMANDS/WORKFLOW --help
| Workflow | Module | Description | Input | Output | |
|---|---|---|---|---|---|
| easy-core-phylo | easy-compare-phylo | easy-compare-sotu | |||
| 1 | cluster | Perform clustering on the input amino acid sequences | amino acid sequences | protein clusters | |
| 2 | distribute | Create the universal gene frequency distribution U-shape plot | protein clusters | gene frequency distribution plot | |
| 3 | extract | Select and extract the core-genes amino acid sequences from each proteome | amino acid sequences, protein clusters | selected amino acids sequences from core genes | |
| 1 | compare | Calculate the pairwise similarities among the given strains using pocp or ani | proteomes/genomes | pairwise similarity matrix | |
| 2 | sotu | Report the strict otu (sotu) groups | pairwise similarity matrix | strict OTU groups | |
| 4 | 2 | phylo | Construct the Phylogenetic tree | amino acid sequences | phylogenetic tree/splits |
| hgt | Detect the horizontal gene transfer | amino acid sequences | HGT donors | ||
| taxon-search | Search for the NCBI taxon id/name | organism’s name/ncbi taxon id | organism’s name/ncbi taxon id | ||
| download | Download the NCBI RefSeq data | organism’s name/ncbi taxon id | ncbi genomes (with proteomes) | ||
| annotate | Perform genome annotation | genome/nucleotide sequences | annotated sequences |