Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to Static Tagging Text Classification #71

Open
manisnesan opened this issue Mar 10, 2024 · 1 comment
Open

Alternative to Static Tagging Text Classification #71

manisnesan opened this issue Mar 10, 2024 · 1 comment

Comments

@manisnesan
Copy link
Owner

Treat it as unsupervised problem.

Approach ( idea inspired from topic modelling on user prompts from Chatbot Arena paper

To study the prompt diversity, we build a topic modeling pipeline with BERTopic3 (Grootendorst, 2022). We start with transforming user prompts into representation vectors using OpenAI’s text embedding model (text-embedding-3-small). To mitigate the curse of dimensionality for data clustering, we employ UMAP (Uniform Manifold Approximation and Projection) (McInnes et al., 2020) to reduce the embedding dimension from 1,536 to 5. We then use the hierarchical density-based clustering algorithm, HDBSCAN, to identify topic clusters with minimum cluster size 32. Finally, to obtain topic labels, we sample 10 prompts from each topic cluster and feed into GPT-4-Turbo for topic summarization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant