-
Notifications
You must be signed in to change notification settings - Fork 8
Changes to the clustering algorithm to support out-of-sample predictions #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
krstopro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First look.
|
|
||
| def fit(self, X, y): | ||
|
|
||
| def fit(self, X, y, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see random_state being used anywhere inside fit. Why is it added as an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed 24-02:
Randomness is already implemented in abstract method both in _kmeans.py and _kmodes.py. So, no need to add it in the fit function.
|
|
||
|
|
||
| # Fit the centroids | ||
| self.centroids_ = self.calc_centroids(X, self.labels_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't add centroids_ field to BiasAwareHierarchicalClustering as it can be used to implement clustering algorithms that do not use centroids (e.g. DBSCAN).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed 24-02:
We don't want to return centroids as part of the _bach.py because it's an abstract class. Best way forward is to implement as part of BiasAwareHierarchicalKMeans and BiasAwareHierarchicalKModes classes. This allows us/others to implement other clustering methods that don't use centroids in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be discussed:
Is predict function using calculated centroids the same as HBAC results?
|
|
||
| @abstractmethod | ||
| def _split(self, X): | ||
| def _split(self, X, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_state also seems redundant over here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See first thread
| """ | ||
| pass | ||
|
|
||
| def binary_chi_square_test(self, m, labels, k, bonf_correct): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self is not used anywhere inside the function, so this should be moved out, probably into utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement in new post_processing.py in subfolder utils
| return p_clust, diff_clust, p1, p0 | ||
|
|
||
|
|
||
| def t_test(self, m, labels, k, bonf_correct, alternative='two-sided'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement in new post_processing.py in subfolder utils
| return p_clust, diff_clust, m1.mean(), m0.mean() | ||
|
|
||
|
|
||
| def calc_ratio_within_between(self, m, labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here. :)
| return self.kmeans.fit_predict(X) | ||
|
|
||
|
|
||
| def calc_centroids(self, X, labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be private (i.e. prefixed with an underscore _), if present in the class at all.
| centroids = np.zeros((X.shape[1], len(np.unique(labels)))) | ||
|
|
||
| # iterate over the labels | ||
| for i, label in enumerate(np.unique(labels)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.unique(labels) should be equivalent to self.n_clusters_
Also, no need to use enumerate you can use just for i in range(self.n_clusters_) or for label in range(self.n_clusters_).
| return self.kmodes.fit_predict(X) | ||
|
|
||
|
|
||
| def calc_centroids(self, X, labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I would make it private.
Also includes some code related to the paper. This should be ignored/rejected.