Skip to content

pyNBS.consensus_clustering.consensus_hclust_hard

Justin Huang edited this page Jan 27, 2018 · 7 revisions

This function performs the consensus clustering step in the pyNBS algorithm. This function requires a list of H factor matrices (Hlist below) generated after multiple iterations of the pyNBS algorithm. This particular consensus clustering method was adapted from the method first described by (Monti et al. 2003). The consensus clustering method has the following steps:

1. Construct patient co-clustering matrix (cc_hard_sim_table)

  1. For each H factor matrix: Find the column-wise argmax of each row and assign the patient to that cluster number (hard clustering).
  2. Create a patient-by-patient matrix counting the number of times each patient pair has appeared in the same H matrix.
  3. Create a patient-by-patient matrix counting the number of times each patient pair has been assigned to the same cluster from (1.1).
  4. Perform an element-wise division of the matrix from (1.3) by the matrix from (1.2). The resulting matrix the patient co-clustering matrix (cc_hard_sim_table).

2. Construct patient linkage map from patient co-clustering matrix

  1. Construct the patient co-clustering distance matrix by taking 1-cc_hard_sim_table (or the patient co-clustering matrix from above (1.4)).
  2. Call the SciPy linkage function on the resulting patient co-clustering distance matrix from (2.1).

3. Assign patient clusters from patient linkage map hierarchy

  1. Use the SciPy fcluster function from the resulting linkage map from (2.2). This will the consensus patient cluster assignments.

Function Call:

consensus_hclust_hard(Hlist, k=3, hclust_linkage_method='average', hclust_linkage_metric='euclidean', verbose=True, **save_args)

Parameters:

  • Hlist (required, list): A list of "H matrices" produced from the mixed_netNMF or NBS_single function. Each H factor matrix must be a pandas DataFrame of size p-by-k, where p is the number of patients and k is the number of clusters.
  • k (optional, int, default=3): Number of patient clusters to construct. k must be equal to the number of columns in each H factor matrix in Hlist.
  • hclust_linkage_method (optional, str, default='average'): The hiearchical clustering linkage method to use. Other methods are described in the scipy.cluster.hierarchy.linkage documentation.
  • hclust_linkage_metric (optional, str, default='euclidean'): The distance metric to use when constructing the linkage map of patients to be clustered in each H matrix. Other distance measures are described in the scipy.spatial.distance.pdist documentation.
  • verbose (optional, bool, default=False): Verbosity flag for reporting on function progress.
  • **save_args (optional, dict, default=None): Dictionary of strings for saving results.
    • save_args['outdir']: A string containing the directory path of which to save the consensus patient co-clustering table and clustering assignments of patients. If this parameter is given within **save_args, the function will automatically save the similarity table and cluster assignment map to this location.
    • save_args['job_name']: A string containing a file prefix for the results saved in save_args['outdir']. Otherwise the similarity table file name will default to cc_matrix.csv, and the cluster assignment file name will default to cluster_assignments.csv.

Returns:

  • cc_hard_sim_table (pandas.DataFrame): The patient-by-patient co-clustering matrix. Each element is the proportion of times each pair of patients were clustered together across all H matrices in the Hlist in which both patients are present. This matrix is used along with Z in the plot_cc_map function to plot the co-clustering map.
  • Z (numpy.ndarray): Linkage map of patients. This is the result of calling the scipy.cluster.hierarchy.linkage function on the distance matrix: 1-cc_hard_sim_table. This linkage map is used along with cc_hard_sim_table in the plot_cc_map function to plot the co-clustering map.
  • cluster_assign (pandas.Series): This is a two-column vector assigning each patient to a single cluster based on cutting the hierarchical tree constructed by Z.
Clone this wiki locally