A framework for orchestrating multiple Large Language Models (LLMs) using sparse top-k routing. MoT dynamically selects and combines outputs from multiple expert models based on input characteristics, enabling efficient and specialized model deployment.
The Mixture of Thoughts framework implements a multi-expert system where different language models act as specialized experts. A learned router dynamically selects the most appropriate experts for each input, combining their outputs through attention mechanisms and stack-based layer partitioning.
- We use Sparse Top-K Routing. You can plug and play any router as you please.
- Stack-Based Layer Partitioning: Divides model layers into stacks for fine-grained expert interaction
- Cross-Expert Attention: Enables information flow between selected experts
- Distributed Training: Supports multi-GPU training with DDP (Distributed Data Parallel)
- Multiple Benchmark Support: Compatible with MMLU, GSM8K, CMMLU, ARC-Challenge, HumanEval, ParEval, and BabelTower datasets
- Python 3.8 or higher
- CUDA-capable GPU (recommended)
- PyTorch 2.0.0 or higher
- Clone the repository:
git clone <repository-url>
cd mot- Install dependencies:
pip install -r requirements.txt- Download expert models (optional, will download automatically during first use):
python download_experts.py --config configs/routerdc_mot_config.jsonmot/
├── mixture_of_thoughts.py # Core MoT framework implementation
├── training.py # Training utilities and loss functions
├── train_ddp.py # Distributed training script
├── train_cuda.py # C++ to CUDA translation training script
├── evaluate.py # General evaluation script
├── evaluate_babeltower.py # BabelTower benchmark evaluation script
├── cuda_dataset.py # CUDA dataset loading utilities
├── cuda_evaluation.py # CUDA code evaluation metrics
├── dataset_loaders.py # General dataset loading utilities
├── utils.py # Helper functions and utilities
├── experiments/ # Experiment results and checkpoints
├── logs/ # Training logs
├── BabelTower/ # BabelTower dataset
│ └── dataset/
│ ├── cpp.para.test.tok
│ ├── cpp.para.valid.tok
│ ├── cuda.para.test.tok
│ └── cuda.para.valid.tok
├── pareval/ # ParEval benchmark integration
│ ├── generate_pareval.py # Code generation script
│ ├── evaluate_pareval.py # Evaluation script
│ ├── drivers/ # ParEval test drivers
│ └── README.md # Detailed ParEval documentation
└── requirements.txt # Package dependencies
Run a simple inference example:
python example.py --demo inferencepython train_ddp.py --config configs/routerdc_mot_config.json./run_experiments.sh --gpus 0,1,2,3 --exp-name mot_experimentOr using torchrun directly:
torchrun --nproc_per_node=4 train_ddp.py --config configs/routerdc_mot_config.jsonThe framework is configured through JSON files. Key configuration parameters include:
{
"experiment_name": "mot_experiment",
"expert_models": [
"model_name_1",
"model_name_2"
],
"mot_config": {
"num_stacks": 4,
"top_k": 3,
"shared_dim": 768,
"router_hidden_dim": 256,
"interaction_heads": 8
},
"training": {
"batch_size": 8,
"learning_rate": 1e-4,
"num_epochs": 10,
"gradient_accumulation_steps": 4
}
}Basic usage example:
from mixture_of_thoughts import MixtureOfThoughts, MoTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load expert models
expert_models = [
AutoModelForCausalLM.from_pretrained("gpt2"),
AutoModelForCausalLM.from_pretrained("distilgpt2")
]
tokenizers = [
AutoTokenizer.from_pretrained("gpt2"),
AutoTokenizer.from_pretrained("distilgpt2")
]
# Configure MoT
config = MoTConfig(
num_stacks=4,
top_k=2,
shared_dim=768
)
# Initialize framework
mot_model = MixtureOfThoughts(
expert_models=expert_models,
tokenizers=tokenizers,
config=config
)
# Run inference
input_ids = tokenizers[0]("Hello, world!", return_tensors="pt").input_ids
outputs = mot_model(input_ids=input_ids)The framework includes loaders for the following benchmark datasets:
- ParEval: Parallel code generation benchmark (see below)
- BabelTower: C-to-CUDA auto-parallelized program translation benchmark (see below)
The MoT framework includes integration with BabelTower, a benchmark for evaluating auto-parallelized program translation from sequential C to parallel CUDA code.
- Download the BabelTower dataset and place it in the
BabelTower/dataset/directory:
mot/BabelTower/dataset/
├── cpp.para.test.tok # C++ test set (180 samples)
├── cpp.para.valid.tok # C++ validation set (184 samples)
├── cuda.para.test.tok # CUDA test set (180 samples)
└── cuda.para.valid.tok # CUDA validation set (184 samples)
# Basic evaluation on test set
python evaluate_babeltower.py
# Evaluate with specific checkpoint
python evaluate_babeltower.py \
--checkpoint_path ./cuda_mot_output/best_model.pt \
--split test
# Quick test with limited samples
python evaluate_babeltower.py --num_samples 10
# Full MoT generation mode (slower but more accurate)
python evaluate_babeltower.py --use_mot_generate| Argument | Default | Description |
|---|---|---|
--dataset_dir |
./BabelTower/dataset |
Path to BabelTower dataset directory |
--checkpoint_path |
./cuda_mot_output/best_model.pt |
Path to trained MoT model checkpoint |
--output_dir |
./babeltower_eval_output |
Directory to save evaluation results |
--split |
test |
Dataset split to evaluate (test or valid) |
--num_samples |
None (all) |
Number of samples to evaluate (for quick testing) |
--max_new_tokens |
256 |
Maximum number of tokens to generate |
--temperature |
0.7 |
Sampling temperature for generation |
--top_p |
0.95 |
Top-p (nucleus) sampling parameter |
--use_mot_generate |
False |
Use full MoT generation with expert interaction |
--no_mot_generate |
- | Disable MoT generation (use fast mode) |
--use_4bit |
True |
Enable 4-bit quantization |
--use_8bit |
False |
Enable 8-bit quantization |
--single_gpu |
True |
Load all models on single GPU |
--check_compilation |
True |
Check CUDA syntax validity |
--save_predictions |
True |
Save detailed predictions to JSON |
--seed |
42 |
Random seed for reproducibility |
Match your evaluation precision with training settings:
| Training Setting | Evaluation Command |
|---|---|
| 4-bit quantization | --use_4bit (default) |
| 8-bit quantization | --use_8bit |
| fp16 (half precision) | Do not specify --use_4bit or --use_8bit |
Note: If you trained with fp16, modify DEFAULT_CONFIG in evaluate_babeltower.py:
DEFAULT_CONFIG = {
...
'use_4bit': False, # Set to False for fp16
'use_8bit': False,
...
}The evaluation script implements metrics from the BabelTower paper:
| Metric | Description |
|---|---|
| BLEU | Standard n-gram matching score (0-100) |
| CodeBLEU | Code-specific metric considering syntax and data flow |
| ParaBLEU | Parallel semantics metric for CUDA (considers CUDA keywords, loop structure, thread indexing patterns) |
| Compilation Accuracy | Percentage of generated code with valid CUDA syntax |
After evaluation, results are saved to --output_dir:
babeltower_<split>_metrics.json: Evaluation metrics summarybabeltower_<split>_predictions.json: Detailed predictions for each sample
Example metrics output:
{
"bleu": 44.57,
"codebleu": 60.01,
"parableu": 17.62,
"compilation_accuracy": 90.0,
"num_samples": 180,
"expert_usage": {"0": 150, "1": 20, "2": 10},
"generation_mode": "fast"
}python evaluate_babeltower.py \
--dataset_dir ./BabelTower/dataset \
--checkpoint_path ./cuda_mot_output/best_model.pt \
--split test \
--use_mot_generate \
--save_predictionspython evaluate_babeltower.py \
--split valid \
--num_samples 20 \
--max_new_tokens 128# Evaluate best model
python evaluate_babeltower.py \
--checkpoint_path ./cuda_mot_output/best_model.pt \
--output_dir ./eval_best
# Evaluate last epoch model
python evaluate_babeltower.py \
--checkpoint_path ./cuda_mot_output/checkpoint-2000.pt \
--output_dir ./eval_lastpython evaluate_babeltower.py \
--use_mot_generate \
--max_new_tokens 512 \
--temperature 0.5The MoT framework includes integration with ParEval, a benchmark for evaluating parallel code generation capabilities across multiple parallelism models (CUDA, OpenMP, MPI, etc.).
cd mot/pareval
# 1. Generate code samples (C++ to CUDA translation)
python generate_pareval.py \
--prompts ~/ParEval/prompts/translation-prompts.json \
--task translation \
--output ./outputs/mot_translation.json \
--num_samples 50
# 2. Evaluate generated code (compile, run, compute pass@k)
python evaluate_pareval.py --input ./outputs/mot_translation.json- Code Generation: Generate parallel code samples using trained MoT models
- Automated Evaluation: Compile and test generated code against ParEval test cases
- Metrics Computation: Calculate pass@k, build@k, and other standard metrics
- Multiple Tasks: Supports both translation (C++ → CUDA) and direct generation tasks
For detailed usage, configuration options, and troubleshooting, see pareval/README.md.
./run_experiments.sh [OPTIONS]
Options:
-g, --gpus GPU IDs to use (e.g., '0,1,2,3')
-n, --num-gpus Number of GPUs to use
-c, --config Path to configuration file
-e, --exp-name Experiment name for logging
-w, --wandb-mode WandB mode: online, offline, disabled
-r, --resume Path to checkpoint to resume frompython download_experts.py --config configs/routerdc_mot_config.json --max-workers 4The MoT framework consists of several key components:
- Router Network: A learnable MLP that assigns scores to each expert based on input embeddings
- Expert Models: Pre-trained language models that serve as specialized experts
- Stack Partitioning: Divides each expert into Q stacks of layers
- Interaction Layers: Cross-attention mechanisms between selected experts
- Output Aggregation: Combines expert outputs using learned weights
The framework uses multiple loss components:
- Primary task loss (e.g., language modeling)
- Router entropy regularization
- Load balancing loss for expert utilization
- Auxiliary expert-specific losses
- Models are cached in
~/.cache/huggingface/hubby default - Supports 8-bit quantization for memory efficiency
- Implements gradient checkpointing for large models
- Uses mixed precision training (fp16/bf16) when available
- Training logs are saved to
logs/directory - Experiment results and checkpoints stored in
experiments/ - Supports Weights & Biases (wandb) integration for experiment tracking
- Real-time training metrics displayed during training
pytest tests/black .
isort .
flake8 .This project is licensed under the MIT License. See LICENSE file for details.
- Out of Memory: Reduce batch size or enable gradient checkpointing
- Model Download Failures: Check network connection and HuggingFace hub access
- DDP Training Issues: Ensure all GPUs are visible and NCCL is properly installed
- Dataset Not Found: Ensure BabelTower dataset is in
./BabelTower/dataset/ - Checkpoint Not Found: Verify the checkpoint path exists
- Low ParaBLEU Score: This metric heavily penalizes incorrect parallel semantics; check if generated code uses proper CUDA thread indexing
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)- BabelTower: Wen et al., "BabelTower: Learning to Auto-parallelized Program Translation", ICML 2022
- ParEval: Parallel Code Foundry benchmark for parallel code generation
For questions or issues, please open an issue on the repository.