You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap after which dbscan::hdbscan is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist in the call to hdbscan when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)
hdbscan needs to compute a minimum spanning tree (MST) on the mutual reachability matrix (which is calculated from the distance matrix). What we would need is a way to go from the data directly to the MST without storing the whole distance/mutual reachability matrix for at least Euclidean distance. I am not quite sure how to do that... Ideas?
Initially I thought that this could have been covered with some bigmemory backend or even altrep but probably there exists smarter ways. I should probably have a look more in detail to the mutual reachability matrix calculation (https://github.com/mhahsler/dbscan/blob/master/src/mrd.cpp#L6) before I can provide you with ideas.
I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using
uwot::umap
after whichdbscan::hdbscan
is applied to find clusters.When trying this out on a corpus with approximately 50000 documents, this fails in the call of
dist
in the call tohdbscan
when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)The text was updated successfully, but these errors were encountered: