Skip to content

An automated topic modeling pipeline for performing Latent Dirichlet Allocation (LDA) on text corpora. It leverages the power of the MALLET toolkit for efficient topic modeling and includes features for automated model selection and rich visualization of results.

Notifications You must be signed in to change notification settings

mehdisn/topic-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LDA Topic Modeling Pipeline (MALLET + gensim)

An end-to-end topic modeling pipeline for CSV datasets. This tool uses MALLET LDA when available and automatically falls back to gensim's LDA implementation if MALLET is not found. It's designed for analyzing text data, such as Stack Overflow posts, to discover underlying topics.


✨ Features

  • Robust Preprocessing: Includes HTML/code stripping, tokenization, bigram/trigram creation, and lemmatization.
  • Reproducibility: Exports the dictionary and corpus to ensure consistent results.
  • Hyperparameter Tuning: Sweeps across various topic counts and multiple random restarts to find the best models.
  • Model Evaluation: Scores models using c_v and u_mass coherence measures and provides a leaderboard of top performers.
  • Comprehensive Export: Saves top models, including topics, term weights, and per-document assignments.
  • Interactive Visualization: Generates pyLDAvis HTML files and summary plots for easy interpretation.
  • Trend Analysis: Creates plots to track topic trends over time (daily, weekly, monthly, etc.).
  • Automated Documentation: Generates a README.md for each run with links to all artifacts and a run_meta.json file with complete metadata.

📁 Repository Structure

.
├── main.py
├── data/                   # Place your input CSV files here
├── results/                # All output files are saved here
│   └── <dataset_name>/
│       ├── <timestamp>/
│       │   ├── README.md
│       │   ├── run_meta.json
│       │   ├── model_catalog.csv
│       │   ├── all_models_coherence.csv
│       │   ├── top_models_index.csv
│       │   ├── dictionary.id2word
│       │   ├── corpus.mm
│       │   └── top1_.../, top2_.../ # Artifacts for each top model
│       └── ...
├── src/
│   ├── config.py
│   ├── data_preprocessor.py
│   ├── topic_modeler.py
│   └── visualization.py
└── mallet-2.0.8/           # Optional: local copy of MALLET

🐍 Environment Setup

You have two supported setup options. If you want to use MALLET LDA, follow Option A. If you are fine with the pure gensim implementation, follow Option B.

Option A: MALLET (Recommended)

This option is required to use the MALLET implementation of LDA.

macOS (including Apple Silicon):

# Create and activate a conda environment
conda create -n lda-mallet 
conda activate lda-mallet

# Install MALLET
brew install mallet

Windows:

  1. Install MALLET and unzip it to a local directory (e.g., C:\mallet\).
  2. Set the MALLET_HOME environment variable to this directory.
  3. Ensure the MALLET executable is at %MALLET_HOME%\bin\mallet.bat.

Verify Installation: Run the following commands to ensure MALLET and the required Python packages are correctly installed.

# Check MALLET
mallet

# Check Python packages
python -c "import gensim; print('gensim:', gensim.__version__); from gensim.models.wrappers.ldamallet import LdaMallet; print('LdaMallet import OK'); import spacy; print('spacy:', spacy.__version__)"

Option B: gensim LDA

This option uses the latest versions of gensim and is simpler to set up but does not support MALLET.

# Create and activate a conda environment
conda create -n lda-gensim 
conda activate lda-gensim

# Install dependencies
pip install gensim spacy pandas matplotlib pyldavis beautifulsoup4 lxml
python -m spacy download en_core_web_sm

🚀 Usage

1. Data Requirements

Your input CSV file must contain the following columns:

  • Body/Text: The main text content (required).
  • Title: The title of the document (recommended, as it's combined with the body).
  • Timestamp: A date/time column, required for trend analysis (e.g., CreatedAt).

You can specify the column names using the --text-col, --title-col, and --date-col flags.

2. Run the Pipeline

Execute main.py with your desired parameters. Here is an example command:

python main.py \
  --data data/YourData.csv \
  --name YourData \
  --mallet /path/to/mallet-binary \
  --runs 3 \
  --k-start 5 \
  --k-limit 21 \
  --k-step 1 \
  --topn 5 \
  --date-col CreatedAt \
  --trend-freq M
  • Omit the --mallet flag to use the pure gensim LDA implementation.
  • On macOS/Linux, the MALLET path is typically /opt/homebrew/bin/mallet or /usr/local/bin/mallet. On Windows, it's the path to mallet.bat.

All results will be saved in a timestamped subdirectory within results/YourData/.

3. CLI Arguments

--data PATH         Path to the input CSV file.
--name STR          Name for the dataset (used for the results folder).
--title-col STR     Name of the title column.
--text-col STR      Name of the body/text column.
--date-col STR      Name of the datetime column for trend plots.
--trend-freq STR    Frequency for trend resampling (D, W, M, Q, Y). Default: M.
--runs INT          Number of random restarts for each topic count (K).
--k-start INT       Minimum number of topics (inclusive).
--k-limit INT       Maximum number of topics (exclusive).
--k-step INT        Step size for sweeping K.
--topn INT          Number of top models to save. Default: 5.
--mallet PATH       Full path to the MALLET binary. If omitted, falls back to gensim.
--no-vis            Flag to skip generating visualizations (for fast, headless runs).

📤 Output Artifacts

For each run, the following artifacts are saved under results/<name>/<timestamp>/:

  • run_meta.json: A complete record of the environment, parameters, package versions, and timing.
  • all_models_coherence.csv: Coherence scores for every model trained.
  • model_catalog.csv: Paths to all saved model files.
  • top_models_index.csv: A leaderboard of the top N models based on coherence.
  • dictionary.id2word & corpus.mm: The dictionary and corpus used, for reproducibility.
  • A subfolder for each top model, containing:
    • lda_vis.html: Interactive pyLDAvis visualization.
    • topic_sizes.png: A plot showing the distribution of dominant topics.
    • topic_trend.png: A plot of topic share over time (if a date column was provided).
    • topics_top_terms.csv: The top terms for each topic.
    • topic_term_weights.csv: A long-format file of (topic, term, weight).
    • doc_topics.csv: The dominant topic for each document.
    • The saved model files (.gensim and .mallet if applicable).

⚙️ Configuration & Customization

You can modify default settings in src/config.py, such as default paths, column names, and topic sweep parameters.

Customization Ideas:

  • Domain-Specific Stopwords: Add custom stopwords in DataPreprocessor to remove common but uninformative words from your dataset.
  • Phrase Detection: Tweak the min_count and threshold parameters for bigram and trigram detection.
  • Dictionary Pruning: Adjust the no_below and no_above parameters in filter_extremes to control the vocabulary size.
  • Coherence Metrics: Add other coherence measures like c_npmi or c_uci for a more comprehensive evaluation.

🆘 Troubleshooting

  • MALLET Not Found: If the script falls back to gensim unexpectedly, ensure the --mallet path is correct or that the MALLET_HOME environment variable is set.
  • Missing NLTK Data: If you get an error about missing stopwords, run:
    import nltk; nltk.download('stopwords')
  • Missing spaCy Model: If you see an error for en_core_web_sm, run:
    python -m spacy download en_core_web_sm
  • Performance: To speed up the pipeline, reduce --runs, narrow the range of K values, or use the --no-vis flag.

About

An automated topic modeling pipeline for performing Latent Dirichlet Allocation (LDA) on text corpora. It leverages the power of the MALLET toolkit for efficient topic modeling and includes features for automated model selection and rich visualization of results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages