PhyloClust

A new fast method for clustering phylogenetic trees using K-means and inferring multiple supertrees

About

=> =============================================================================================================================
=> Program   : PhyloClust - 2022
=> Authors   : Nadia Tahiri and Aleksandr Koshkarov (University of Sherbrooke)
=> This program clusters phylogenetic trees using the k-means partitioning algorithm.
=> These trees may have the same or different, but mutually overlapping, sets of leaves (the multiple supertree problem).
=> Phylogenetic trees must be given in the Newick format (program input).
=> A partitioning of the input trees in K clusters of trees is returned as output. 
=> The optimal number of clusters can be determined either by the Silhouette (SH), Gap Statistic (Gap) or by the Ball-Hall (BH) cluster 
=> validity index adapted for tree clustering.
=> A supertree can then be inferred for each cluster of trees.
=> The Robinson and Foulds topological distance is used in the objective function of K-means.
=> The list of the program parameters is specified below.
=> =============================================================================================================================

Installation

$ git clone https://github.com/tahiri-lab/PhyloClust
$ make
or
$ make install

clean project
$ make clean

Help

$ make help

Examples

Please execute the following command line:
=> For trees: ./PhyloClust -tree input_file cluster_validity_index α Kmin Kmax

=> input_file: the input file for the program
=> cluster_validity_index: the cluster validity index used in K-means (1 for Silhouette, 2 Gap statistic and 3 for Ball-Hall)
=> α: is the penalty parameter for species overlap in phylogenetic trees (must be between 0 and 1)
=> Kmin: is the minimum number of clusters in K-means. 
    	- For SH,  Kmin>=2,
    	- For Gap, Kmin>=1,
	- For BH,  Kmin>=1.
=> Kmax: the maximum number of clusters in K-means. 
    	- Kmax must be less or equal to N-1 (where N is the number of input trees).

Command line execution examples:
     1) input_file = data/Covid-19_trees.txt, cluster_validity_index = SH, α = 0.1, Kmin = 3, Kmax = 8):
    ./PhyloClust -tree ../data/Covid-19_trees.txt 1 0.1 3 8
     2) input_file = data/all_trees_woese.txt, cluster_validity_index = SH, α = 1, Kmin = 2, Kmax = 10):
    ./PhyloClust -tree ../data/all_trees_woese.txt 1 1 2 10

Input

=> The input data sets are located in the folder "data"

Output

=> See the folder "output"
The output is in the following files:
1) stat.csv - for the clustering statistics;
2) output.txt - for the cluster content.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
output		output
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhyloClust

About

Installation

Help

Examples

Input

Output

About

Releases

Packages

Languages

tahiri-lab/PhyloClust

Folders and files

Latest commit

History

Repository files navigation

PhyloClust

About

Installation

Help

Examples

Input

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages