hdbscan #43

jwijffels · 2021-02-15T09:18:29Z

I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap after which dbscan::hdbscan is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist in the call to hdbscan when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)

> library(dbscan)
> docs_umap <- matrix(rnorm(50000*2), ncol = 2)
> cl <- dbscan::hdbscan(docs_umap, minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 0 4 4 4 0 4 4 0 4 4 ...

The text was updated successfully, but these errors were encountered:

mhahsler · 2021-02-16T20:46:54Z

hdbscan needs to compute a minimum spanning tree (MST) on the mutual reachability matrix (which is calculated from the distance matrix). What we would need is a way to go from the data directly to the MST without storing the whole distance/mutual reachability matrix for at least Euclidean distance. I am not quite sure how to do that... Ideas?

jwijffels · 2021-02-17T11:01:32Z

Initially I thought that this could have been covered with some bigmemory backend or even altrep but probably there exists smarter ways. I should probably have a look more in detail to the mutual reachability matrix calculation (https://github.com/mhahsler/dbscan/blob/master/src/mrd.cpp#L6) before I can provide you with ideas.

michalovadek mentioned this issue Feb 15, 2021

hdbscan struggles with large matrices michalovadek/top2vecr#2

Open

mhahsler added the question label Apr 27, 2021

mhahsler closed this as completed Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdbscan #43

hdbscan #43

jwijffels commented Feb 15, 2021

mhahsler commented Feb 16, 2021

jwijffels commented Feb 17, 2021

hdbscan #43

hdbscan #43

Comments

jwijffels commented Feb 15, 2021

mhahsler commented Feb 16, 2021

jwijffels commented Feb 17, 2021