BERTopic efficiently analyzes social network posts to extract meaningful topics using a multi-step process:
- Embedding: Convert posts into numerical representations using sentence-transformers models.
- Dimensionality Reduction: Reduce data dimensionality with techniques like UMAP.
- Clustering: Group similar posts using HDBSCAN, a density-based clustering method.
- Bag-of-Words: Generate bag-of-words representations for each cluster.
- Topic Representation: Modify TF-IDF to highlight cluster-specific words, forming topic descriptions.
More info: https://maartengr.github.io/BERTopic/index.html
The generated labels and groups can be visualized in many ways!
pip install -r requirements.txt
In a python interpreter:
import nltk
nltk.download("stopwords")
Put them in ./raw_data/
python src/main.py