Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdbscan #43

Closed
jwijffels opened this issue Feb 15, 2021 · 2 comments
Closed

hdbscan #43

jwijffels opened this issue Feb 15, 2021 · 2 comments
Labels

Comments

@jwijffels
Copy link

I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap after which dbscan::hdbscan is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist in the call to hdbscan when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)

> library(dbscan)
> docs_umap <- matrix(rnorm(50000*2), ncol = 2)
> cl <- dbscan::hdbscan(docs_umap, minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 0 4 4 4 0 4 4 0 4 4 ...
@mhahsler
Copy link
Owner

hdbscan needs to compute a minimum spanning tree (MST) on the mutual reachability matrix (which is calculated from the distance matrix). What we would need is a way to go from the data directly to the MST without storing the whole distance/mutual reachability matrix for at least Euclidean distance. I am not quite sure how to do that... Ideas?

@jwijffels
Copy link
Author

Initially I thought that this could have been covered with some bigmemory backend or even altrep but probably there exists smarter ways. I should probably have a look more in detail to the mutual reachability matrix calculation (https://github.com/mhahsler/dbscan/blob/master/src/mrd.cpp#L6) before I can provide you with ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants