KnowEnG's General Clustering Pipeline
This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, General Clustering Pipeline.
This pipeline clusters a spreadsheet's columns, with various methods:
Options
Method
Parameters
K-means
K Means
kmeans
hierarchical clustering
hierarchical clustering
hclust
Linked hierarchical clustering
hierarchical clustering constraint
link_hclust
Bootstrapped hierarchical clustering
consensus hierarchical clustering
cc_ hclust
Bootstrapped K-means
consensus K Means
cc_kmeans
Bootstrapped Linked hierarchical clustering
consensus linked hierarchical clustering
cc_link_hclust
How to run this pipeline with Our data
1. Clone the General_Clustering_Pipeline Repo
git clone https://github.com/KnowEnG-Research/General_Clustering_Pipeline.git
2. Install the following, for Linux
apt-get install -y python3-pip libfreetype6-dev libxft-dev libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install pyyaml knpackage scipy==0.19.1 numpy==1.11.1 pandas==0.18.1 matplotlib==1.4.2 scikit-learn==0.17.1
3. Change directory to General_Clustering_Pipeline
cd General_Clustering_Pipeline
4. Change directory to test
5. Create a local directory "run_dir" and place all the run files in it
6. Use one of the following "make" commands to select and run a clustering option:
Command
Option
make run_kmeans_binary
Clustering with k-means
make run_kmeans_continuous
make run_hclust_binary
Hierarchical Clustering
make run_hclust_continuous
make run_link_hclust_binary
Hierarchical linkage Clustering
make run_link_hclust_continuous
make run_cc_kmeans_binary
Consensus Clustering with k-means
make run_cc_kmeans_continuous
make run_cc_hclust_binary
Consensus Hierarchical Clustering
make run_cc_hclust_continuous
make run_cc_link_hclust_binary
Consensus Hierarchical linkage Clustering
How to run this pipeline with Your data
Follow steps 1-5 above then do the following:
* Create your run directory
* Change directory to the run directory
* Create your results directory
* Create run_paramters file (YAML Format)
Look for examples of run_parameters in the General_Clustering_Pipeline/data/run_files zTEMPLATE_cc_hclust.yml
* Modify run_paramters file (YAML Format)
Change processing_method to one of: serial, parallel depending on your machine.
processing_method: serial
set the data file targets to the files you want to run, and the parameters as appropriate for your data.
* Run the General Clustering Pipeline:
Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
python3 ../src/general_clustering.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_nmf.yml
Description of "run_parameters" file
Key
Value
Comments
method
kmeans ,hclust ,link_hclust ,cc_kmeans , cc_hclust , cc_link_hclust
Choose clustering method
affinity_metric
euclidean , manhattan , jaccard
Choose clustering affinity
linkage_criterion
ward , complete , average
Choose clustering affinity
spreadsheet_name_full_path
directory+spreadsheet_name
Path and file name of user supplied gene sets
results_directory
directory
Directory to save the output files
tmp_directory
./run_dir/tmp
Directory to save the temporary files
number_of_clusters
3
Estimated number of clusters
number_of_bootstraps
4
Number of bootstraps for cc_kmeans, cc_hclust and cc_link_hclust
rows_sampling_fraction
0.8
Select 80% of spreadsheet rows
cols_sampling_fraction
0.8
Select 80% of spreadsheet columns
top_number_of_rows
10
Top number of features to analyze
processing_method
serial or parallel or distribute
Choose processing method
parallelism
number of cores
Set number of cores for speed or memory
threshold
10
Threshold to define categorical data and continuous data in evaluation toolbox
nearest_neighbors
10
Number of Nearest Neighbors in cc_link_hclust method
spreadsheet_name = EXPR_GSE_METABRIC_lymphN_binary.tsv.gz
Description of Output files saved in results directory
Output files of all methods save row by col heatmap variances per row with name row_variance_{method}_{timestamp}_viz.tsv .
variance
row 1
float
...
...
row m
float
Output files of all the methods save row by col heatmap with name row_by_col_heatmp_{method}_{timestamp}_viz.tsv .
col 1
...
col n
row 1
float
...
float
...
...
...
...
row m
float
...
float
Output files of all methods save col to cluster map with name col_labeled_by_cluster_{method}_{timestamp}_viz.tsv .
cluster
col 1
int
...
...
col n
int
Output files of all methods save row scores by cluster with name row_averages_by_cluster_{method}_{timestamp}_viz.tsv .
cluster 1
...
cluster k
row 1
float
...
float
...
...
...
...
row m
float
...
float
Output files of all methods save spreadsheet with top ranked rows per column with name top_row_by_cluster_{method}_{timestamp}_download.tsv .
cluster 1
...
cluster k
row 1
1/0
...
1/0
...
...
...
...
row m
1/0
...
1/0
All methods save silhouette number of clusters and corresponding silhouette score with name silhouette_average_{method}_{timestamp}_viz.tsv.
File Example:
silhouette number of clusters = 3, corresponding silhouette score = 1