Changes to the clustering algorithm to support out-of-sample predictions #15

fholstege · 2025-02-13T08:36:03Z

Also includes some code related to the paper. This should be ignored/rejected.

krstopro

First look.

krstopro · 2025-02-22T17:36:11Z

unsupervised_bias_detection/clustering/_bahc.py


-    def fit(self, X, y):
+
+    def fit(self, X, y, random_state=None):


I don't see random_state being used anywhere inside fit. Why is it added as an argument?

As discussed 24-02:
Randomness is already implemented in abstract method both in _kmeans.py and _kmodes.py. So, no need to add it in the fit function.

krstopro · 2025-02-22T17:38:26Z

unsupervised_bias_detection/clustering/_bahc.py

+
+
+        # Fit the centroids
+        self.centroids_ = self.calc_centroids(X, self.labels_)


I wouldn't add centroids_ field to BiasAwareHierarchicalClustering as it can be used to implement clustering algorithms that do not use centroids (e.g. DBSCAN).

As discussed 24-02:
We don't want to return centroids as part of the _bach.py because it's an abstract class. Best way forward is to implement as part of BiasAwareHierarchicalKMeans and BiasAwareHierarchicalKModes classes. This allows us/others to implement other clustering methods that don't use centroids in the future.

To be discussed:

Is predict function using calculated centroids the same as HBAC results?

krstopro · 2025-02-22T17:38:46Z

unsupervised_bias_detection/clustering/_bahc.py

+
    @abstractmethod
-    def _split(self, X):
+    def _split(self, X, random_state=None):


random_state also seems redundant over here.

See first thread

krstopro · 2025-02-22T17:46:33Z

unsupervised_bias_detection/clustering/_bahc.py

        """
        pass
+
+    def binary_chi_square_test(self, m, labels, k, bonf_correct):


self is not used anywhere inside the function, so this should be moved out, probably into utils.

Implement in new post_processing.py in subfolder utils

krstopro · 2025-02-22T17:46:45Z

unsupervised_bias_detection/clustering/_bahc.py

+        return p_clust, diff_clust, p1, p0
+
+
+    def t_test(self, m, labels, k, bonf_correct, alternative='two-sided'):


Implement in new post_processing.py in subfolder utils

krstopro · 2025-02-22T17:46:54Z

unsupervised_bias_detection/clustering/_bahc.py

+        return p_clust, diff_clust, m1.mean(), m0.mean()
+
+
+    def calc_ratio_within_between(self, m, labels):


Also here. :)

krstopro · 2025-02-22T18:03:17Z

unsupervised_bias_detection/clustering/_kmeans.py

        return self.kmeans.fit_predict(X)
+
+
+    def calc_centroids(self, X, labels):


This should be private (i.e. prefixed with an underscore _), if present in the class at all.

krstopro · 2025-02-22T18:04:15Z

unsupervised_bias_detection/clustering/_kmeans.py

+        centroids = np.zeros((X.shape[1], len(np.unique(labels))))
+
+        # iterate over the labels
+        for i, label in enumerate(np.unique(labels)):


np.unique(labels) should be equivalent to self.n_clusters_

Also, no need to use enumerate you can use just for i in range(self.n_clusters_) or for label in range(self.n_clusters_).

krstopro · 2025-02-22T18:19:52Z

unsupervised_bias_detection/clustering/_kmodes.py

        return self.kmodes.fit_predict(X)
+
+
+    def calc_centroids(self, X, labels):


Same here, I would make it private.

fholstege added 7 commits November 5, 2024 11:10

Updated code for analysis of experiment

4c9b165

Create requirements.txt

f6e96c8

New notebook

a52518d

Fixed mistake in code

42fa3e4

Update

991bc5a

Null distribution implementation

fad335f

changes to lib

2a6377f

krstopro reviewed Feb 22, 2025

View reviewed changes



		# Fit the centroids
		self.centroids_ = self.calc_centroids(X, self.labels_)

		return p_clust, diff_clust, p1, p0


		def t_test(self, m, labels, k, bonf_correct, alternative='two-sided'):

		return p_clust, diff_clust, m1.mean(), m0.mean()


		def calc_ratio_within_between(self, m, labels):

		return self.kmeans.fit_predict(X)


		def calc_centroids(self, X, labels):

		return self.kmodes.fit_predict(X)


		def calc_centroids(self, X, labels):

Changes to the clustering algorithm to support out-of-sample predictions #15

Are you sure you want to change the base?

Changes to the clustering algorithm to support out-of-sample predictions #15

Uh oh!

Conversation

fholstege commented Feb 13, 2025

Uh oh!

krstopro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfparie Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jfparie Feb 24, 2025 •

edited

Loading