Skip to content

kalininalab/CACY

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CACY: Core genes Alignment-free phylogeny and Capture of taxonomY relationship

Attention

This repository is currently under active development. New features and documentation are coming soon.

Summary

CACY is a command-line tool for the Phylogenetic and Taxonomic analysis of closely related organisms.

For Phylogenetic analysis, the tool uses alignment-free methods to construct Phylogenetic trees based on the amino acid sequences from core genes. Given a list of proteomes in fasta format from various species, it performs clustering with all the proteins and selects those from core genomes. Then, the pipeline feeds them into alignment-free methods to generate the Phylogenetic tree (or splits). For Taxonomic analysis, the tool calculates the pairwise Average Nucleotide Identity (ANI) or Percentage Of Conserved Proteins (POCP) values, and then reports strict Operational Taxonomic Units (OTUs) using the graph-based algorithm.

You can find a more detailed explanation of the tool on ReadTheDocs.

Workflow

Installation

CACY is installable from conda:

git clone https://github.com/garrison-chen/CACY.git && cd CACY
conda env create --file=environments.yaml
conda activate cacy

Next, run the following command to install the additional dependencies:

git clone https://github.com/gi-bielefeld/sans.git
cd sans
make

Usage

CACY is designed to run several workflows as integrations of different modules, the latter can also be run individually. With a list of closely-related strains (proteomes or genomes) as input:

Workflow: easy-core-phylo
Modules: cluster > distribute > extract > phylo
Run this workflow if you want to construct a Phylogenetic tree using the core genes. This workflow performs clustering with all the emsumbled proteins and selects those from core genomes. Then the workflow feeds them into the alignment-free methods to efficiently generate the Phylogenetic trees.

python CACY.py easy-core-phylo -i input_directoty -o output_directory -c clusering_option -f threshold



Workflow: easy-compare-sotu
Modules: compare > sotu
Run this workflow if you want to calculate the pairwise ANI or POCP values and report strict OTUs. This workflow uses fastANI and POCP to calculate pairwise ANI and POCP values and store the results to a phylip-formatted lower triangle matrix. This matrix is then converted to an adjacency matrix according to a user-defined cutoff. Next, the workflow turns the adjacency matrix into an undirected graph and applies the Bron-Kerbosch algorithm using solver from NetworkX to calculate all the maximal cliques as the strict OTU groups.

python CACY.py easy-compare-sotu -i input_directory -o output_directory -c clustering_option 



Workflow: easy-compare-phylo
Module: compare > phylo
Run this workflow if you want to construct Phylogenetic trees using the pairwise ANI or POCP values. Similar to the previous workflow, this one uses fastANI and POCP to calculate pairwise ANI and POCP values. Then the workflow applies the neighbour-joining algorithm to construct the phylogenetic trees.

python CACY.py easy-compare-phylo -i input_directory -o output_directory -m similarity_metrix 



Run the following workflow if you want to identify the cutoffs for separating the specific taxon.

python CACY.py easy-todo 



The full usage is shown below:

CACY (Core genes Alignment-free phylogeny and Capture of taxonomY relationship), V1.0.0, Mar 2025

WORKFLOW:
[easy-core-phylo]              [cluster] > [distribute] > [extract] > [phylo]
[easy-compare-sotu]            [compare] > [sotu]

COMMANDS (core modules):
[cluster]                      Perform clustering on the input amino acid sequences
[distribute]                   Create the universal gene frequency distribution U-shape plot
[extract]                      Select and extract the core-genes amino acid sequences from each proteome
[compare]                      Calculate the pairwise similarities among the given strains using POCP or ANI
[sotu]                         Report the strict OTU (sOTU) groups
[phylo]                        Construct the Phylogenetic tree
[hgt]                          Detect the horizontal gene transfer

COMMANDS (auxiliary modules):
[taxon-search]                 search for the NCBI taxon id/name
[download]                     download the NCBI RefSeq data
[annotate]                     perform genome annotation

Usage: python CACY.py COMMANDS/WORKFLOW [OPTIONS]
Possible [OPTIONS] for COMMANDS/WORKFLOW can be seen with syntax: python CACY.py COMMANDS/WORKFLOW --help

Description of workflows and modules

Workflow Module Description Input Output
easy-core-phylo easy-compare-phylo easy-compare-sotu
1 cluster Perform clustering on the input amino acid sequences amino acid sequences protein clusters
2 distribute Create the universal gene frequency distribution U-shape plot protein clusters gene frequency distribution plot
3 extract Select and extract the core-genes amino acid sequences from each proteome amino acid sequences, protein clusters selected amino acids sequences from core genes
1 compare Calculate the pairwise similarities among the given strains using pocp or ani proteomes/genomes pairwise similarity matrix
2 sotu Report the strict otu (sotu) groups pairwise similarity matrix strict OTU groups
4 2 phylo Construct the Phylogenetic tree amino acid sequences phylogenetic tree/splits
hgt Detect the horizontal gene transfer amino acid sequences HGT donors
taxon-search Search for the NCBI taxon id/name organism’s name/ncbi taxon id organism’s name/ncbi taxon id
download Download the NCBI RefSeq data organism’s name/ncbi taxon id ncbi genomes (with proteomes)
annotate Perform genome annotation genome/nucleotide sequences annotated sequences

About

Core genes Alignment-free phylogeny and Capture of taxonomY relationship

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published