-
Notifications
You must be signed in to change notification settings - Fork 22
pyNBS.consensus_clustering.consensus_hclust_hard
Justin Huang edited this page Jan 27, 2018
·
7 revisions
This function performs the consensus clustering step in the pyNBS algorithm. This function requires a list of H factor matrices (Hlist
below) generated after multiple iterations of the pyNBS algorithm. This particular consensus clustering method was adapted from the method first described by (Monti et al. 2003). The consensus clustering method has the following steps:
1. Construct patient co-clustering matrix (cc_hard_sim_table
)
- For each H factor matrix: Find the column-wise argmax of each row and assign the patient to that cluster number (hard clustering).
- Create a patient-by-patient matrix counting the number of times each patient pair has appeared in the same H matrix.
- Create a patient-by-patient matrix counting the number of times each patient pair has been assigned to the same cluster from (1.1).
- Perform an element-wise division of the matrix from (1.3) by the matrix from (1.2). The resulting matrix the patient co-clustering matrix (
cc_hard_sim_table
).
2. Construct patient linkage map from patient co-clustering matrix
- Construct the patient co-clustering distance matrix by taking 1-
cc_hard_sim_table
(or the patient co-clustering matrix from above (1.4)). - Call the
SciPy linkage
function on the resulting patient co-clustering distance matrix from (2.1).
3. Assign patient clusters from patient linkage map hierarchy
- Use the
SciPy fcluster
function from the resulting linkage map from (2.2). This will the consensus patient cluster assignments.
consensus_hclust_hard(
Hlist, k=3, hclust_linkage_method='average', hclust_linkage_metric='euclidean', verbose=True, **save_args
)
-
Hlist (required, list): A list of "H matrices" produced from the
mixed_netNMF
orNBS_single
function. Each H factor matrix must be a pandas DataFrame of size p-by-k, where p is the number of patients and k is the number of clusters. -
k (optional, int, default=3): Number of patient clusters to construct. k must be equal to the number of columns in each H factor matrix in
Hlist
. -
hclust_linkage_method (optional, str, default='average'): The hiearchical clustering linkage method to use. Other methods are described in the
scipy.cluster.hierarchy.linkage
documentation. -
hclust_linkage_metric (optional, str, default='euclidean'): The distance metric to use when constructing the linkage map of patients to be clustered in each H matrix. Other distance measures are described in the
scipy.spatial.distance.pdist
documentation. - verbose (optional, bool, default=False): Verbosity flag for reporting on function progress.
-
**save_args (optional, dict, default=None): Dictionary of strings for saving results.
-
save_args['outdir']
: A string containing the directory path of which to save the consensus patient co-clustering table and clustering assignments of patients. If this parameter is given within **save_args, the function will automatically save the similarity table and cluster assignment map to this location. -
save_args['job_name']
: A string containing a file prefix for the results saved insave_args['outdir']
. Otherwise the similarity table file name will default tocc_matrix.csv
, and the cluster assignment file name will default tocluster_assignments.csv
.
-
-
cc_hard_sim_table (pandas.DataFrame): The patient-by-patient co-clustering matrix. Each element is the proportion of times each pair of patients were clustered together across all H matrices in the
Hlist
in which both patients are present. This matrix is used along withZ
in theplot_cc_map
function to plot the co-clustering map. -
Z (numpy.ndarray): Linkage map of patients. This is the result of calling the
scipy.cluster.hierarchy.linkage
function on the distance matrix: 1-cc_hard_sim_table
. This linkage map is used along withcc_hard_sim_table
in theplot_cc_map
function to plot the co-clustering map. -
cluster_assign (pandas.Series): This is a two-column vector assigning each patient to a single cluster based on cutting the hierarchical tree constructed by
Z
.