An end-to-end topic modeling pipeline for CSV datasets. This tool uses MALLET LDA when available and automatically falls back to gensim's LDA implementation if MALLET is not found. It's designed for analyzing text data, such as Stack Overflow posts, to discover underlying topics.
- Robust Preprocessing: Includes HTML/code stripping, tokenization, bigram/trigram creation, and lemmatization.
- Reproducibility: Exports the dictionary and corpus to ensure consistent results.
- Hyperparameter Tuning: Sweeps across various topic counts and multiple random restarts to find the best models.
- Model Evaluation: Scores models using c_v and u_mass coherence measures and provides a leaderboard of top performers.
- Comprehensive Export: Saves top models, including topics, term weights, and per-document assignments.
- Interactive Visualization: Generates pyLDAvis HTML files and summary plots for easy interpretation.
- Trend Analysis: Creates plots to track topic trends over time (daily, weekly, monthly, etc.).
- Automated Documentation: Generates a
README.md
for each run with links to all artifacts and arun_meta.json
file with complete metadata.
.
├── main.py
├── data/ # Place your input CSV files here
├── results/ # All output files are saved here
│ └── <dataset_name>/
│ ├── <timestamp>/
│ │ ├── README.md
│ │ ├── run_meta.json
│ │ ├── model_catalog.csv
│ │ ├── all_models_coherence.csv
│ │ ├── top_models_index.csv
│ │ ├── dictionary.id2word
│ │ ├── corpus.mm
│ │ └── top1_.../, top2_.../ # Artifacts for each top model
│ └── ...
├── src/
│ ├── config.py
│ ├── data_preprocessor.py
│ ├── topic_modeler.py
│ └── visualization.py
└── mallet-2.0.8/ # Optional: local copy of MALLET
You have two supported setup options. If you want to use MALLET LDA, follow Option A. If you are fine with the pure gensim implementation, follow Option B.
This option is required to use the MALLET implementation of LDA.
macOS (including Apple Silicon):
# Create and activate a conda environment
conda create -n lda-mallet
conda activate lda-mallet
# Install MALLET
brew install mallet
Windows:
- Install MALLET and unzip it to a local directory (e.g.,
C:\mallet\
). - Set the
MALLET_HOME
environment variable to this directory. - Ensure the MALLET executable is at
%MALLET_HOME%\bin\mallet.bat
.
Verify Installation: Run the following commands to ensure MALLET and the required Python packages are correctly installed.
# Check MALLET
mallet
# Check Python packages
python -c "import gensim; print('gensim:', gensim.__version__); from gensim.models.wrappers.ldamallet import LdaMallet; print('LdaMallet import OK'); import spacy; print('spacy:', spacy.__version__)"
This option uses the latest versions of gensim
and is simpler to set up but does not support MALLET.
# Create and activate a conda environment
conda create -n lda-gensim
conda activate lda-gensim
# Install dependencies
pip install gensim spacy pandas matplotlib pyldavis beautifulsoup4 lxml
python -m spacy download en_core_web_sm
Your input CSV file must contain the following columns:
- Body/Text: The main text content (required).
- Title: The title of the document (recommended, as it's combined with the body).
- Timestamp: A date/time column, required for trend analysis (e.g.,
CreatedAt
).
You can specify the column names using the --text-col
, --title-col
, and --date-col
flags.
Execute main.py
with your desired parameters. Here is an example command:
python main.py \
--data data/YourData.csv \
--name YourData \
--mallet /path/to/mallet-binary \
--runs 3 \
--k-start 5 \
--k-limit 21 \
--k-step 1 \
--topn 5 \
--date-col CreatedAt \
--trend-freq M
- Omit the
--mallet
flag to use the puregensim
LDA implementation. - On macOS/Linux, the MALLET path is typically
/opt/homebrew/bin/mallet
or/usr/local/bin/mallet
. On Windows, it's the path tomallet.bat
.
All results will be saved in a timestamped subdirectory within results/YourData/
.
--data PATH Path to the input CSV file.
--name STR Name for the dataset (used for the results folder).
--title-col STR Name of the title column.
--text-col STR Name of the body/text column.
--date-col STR Name of the datetime column for trend plots.
--trend-freq STR Frequency for trend resampling (D, W, M, Q, Y). Default: M.
--runs INT Number of random restarts for each topic count (K).
--k-start INT Minimum number of topics (inclusive).
--k-limit INT Maximum number of topics (exclusive).
--k-step INT Step size for sweeping K.
--topn INT Number of top models to save. Default: 5.
--mallet PATH Full path to the MALLET binary. If omitted, falls back to gensim.
--no-vis Flag to skip generating visualizations (for fast, headless runs).
For each run, the following artifacts are saved under results/<name>/<timestamp>/
:
run_meta.json
: A complete record of the environment, parameters, package versions, and timing.all_models_coherence.csv
: Coherence scores for every model trained.model_catalog.csv
: Paths to all saved model files.top_models_index.csv
: A leaderboard of the top N models based on coherence.dictionary.id2word
&corpus.mm
: The dictionary and corpus used, for reproducibility.- A subfolder for each top model, containing:
lda_vis.html
: Interactive pyLDAvis visualization.topic_sizes.png
: A plot showing the distribution of dominant topics.topic_trend.png
: A plot of topic share over time (if a date column was provided).topics_top_terms.csv
: The top terms for each topic.topic_term_weights.csv
: A long-format file of (topic, term, weight).doc_topics.csv
: The dominant topic for each document.- The saved model files (
.gensim
and.mallet
if applicable).
You can modify default settings in src/config.py
, such as default paths, column names, and topic sweep parameters.
Customization Ideas:
- Domain-Specific Stopwords: Add custom stopwords in
DataPreprocessor
to remove common but uninformative words from your dataset. - Phrase Detection: Tweak the
min_count
andthreshold
parameters for bigram and trigram detection. - Dictionary Pruning: Adjust the
no_below
andno_above
parameters infilter_extremes
to control the vocabulary size. - Coherence Metrics: Add other coherence measures like
c_npmi
orc_uci
for a more comprehensive evaluation.
- MALLET Not Found: If the script falls back to
gensim
unexpectedly, ensure the--mallet
path is correct or that theMALLET_HOME
environment variable is set. - Missing NLTK Data: If you get an error about missing
stopwords
, run:import nltk; nltk.download('stopwords')
- Missing spaCy Model: If you see an error for
en_core_web_sm
, run:python -m spacy download en_core_web_sm
- Performance: To speed up the pipeline, reduce
--runs
, narrow the range of K values, or use the--no-vis
flag.