Skip to content

Conversation

@fholstege
Copy link
Collaborator

Also includes some code related to the paper. This should be ignored/rejected.

Copy link
Collaborator

@krstopro krstopro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First look.


def fit(self, X, y):

def fit(self, X, y, random_state=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see random_state being used anywhere inside fit. Why is it added as an argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed 24-02:
Randomness is already implemented in abstract method both in _kmeans.py and _kmodes.py. So, no need to add it in the fit function.



# Fit the centroids
self.centroids_ = self.calc_centroids(X, self.labels_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add centroids_ field to BiasAwareHierarchicalClustering as it can be used to implement clustering algorithms that do not use centroids (e.g. DBSCAN).

Copy link
Contributor

@jfparie jfparie Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed 24-02:
We don't want to return centroids as part of the _bach.py because it's an abstract class. Best way forward is to implement as part of BiasAwareHierarchicalKMeans and BiasAwareHierarchicalKModes classes. This allows us/others to implement other clustering methods that don't use centroids in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be discussed:

Is predict function using calculated centroids the same as HBAC results?


@abstractmethod
def _split(self, X):
def _split(self, X, random_state=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state also seems redundant over here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See first thread

"""
pass

def binary_chi_square_test(self, m, labels, k, bonf_correct):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self is not used anywhere inside the function, so this should be moved out, probably into utils.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement in new post_processing.py in subfolder utils

return p_clust, diff_clust, p1, p0


def t_test(self, m, labels, k, bonf_correct, alternative='two-sided'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement in new post_processing.py in subfolder utils

return p_clust, diff_clust, m1.mean(), m0.mean()


def calc_ratio_within_between(self, m, labels):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here. :)

return self.kmeans.fit_predict(X)


def calc_centroids(self, X, labels):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be private (i.e. prefixed with an underscore _), if present in the class at all.

centroids = np.zeros((X.shape[1], len(np.unique(labels))))

# iterate over the labels
for i, label in enumerate(np.unique(labels)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique(labels) should be equivalent to self.n_clusters_

Also, no need to use enumerate you can use just for i in range(self.n_clusters_) or for label in range(self.n_clusters_).

return self.kmodes.fit_predict(X)


def calc_centroids(self, X, labels):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I would make it private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants