This repository implements a structured experimentation framework for evaluating Large Language Model (LLM) response quality using simulated users, behavioral metrics, and causal inference methods.
The system simulates user conversations with different LLM configurations and estimates treatment effects on response quality using experimentation techniques commonly used in large-scale product analytics.
- Simulated user agents generating conversational interaction data
- Structured A/B testing framework for LLM responses
- Behavioral metrics derived from conversation outcomes
- Causal inference using Average Treatment Effect (ATE) and CUPED variance reduction
- Reproducible experiment suites for prompt strategies, temperature tuning, and model scaling
Simulated Users → Conversation Engine → LLM Models → Conversation Logs → Metrics Engine → Experiment Analysis → Leaderboards
Run an experiment:
python -m scripts.run_experiment --experiment-config config/experiment.yaml --personas-config config/personas.yaml --tasks-config config/tasks.yamlCompute metrics:
python -m scripts.compute_metrics --experiment-config config/experiment.yamlAnalyze experiment results:
python -m scripts.analyze_experiment --experiment-config config/experiment.yaml --sample-per-arm 10config/
scripts/
src/
logs/
results/
docs/
Experiments estimate treatment effects using RQI (Response Quality Index) with confidence intervals. Multiple experiment suites evaluate prompt strategies, temperature settings, and model scaling effects.
