Skip to content

vishal-labade/llm_exp_platform_v2

Repository files navigation

LLM Response Quality Experimentation Platform

Overview

This repository implements a structured experimentation framework for evaluating Large Language Model (LLM) response quality using simulated users, behavioral metrics, and causal inference methods.

The system simulates user conversations with different LLM configurations and estimates treatment effects on response quality using experimentation techniques commonly used in large-scale product analytics.

Key Contributions

  • Simulated user agents generating conversational interaction data
  • Structured A/B testing framework for LLM responses
  • Behavioral metrics derived from conversation outcomes
  • Causal inference using Average Treatment Effect (ATE) and CUPED variance reduction
  • Reproducible experiment suites for prompt strategies, temperature tuning, and model scaling

High-Level System Flow

Simulated Users → Conversation Engine → LLM Models → Conversation Logs → Metrics Engine → Experiment Analysis → Leaderboards

alt text

Quickstart

Run an experiment:

python -m scripts.run_experiment   --experiment-config config/experiment.yaml   --personas-config config/personas.yaml   --tasks-config config/tasks.yaml

Compute metrics:

python -m scripts.compute_metrics --experiment-config config/experiment.yaml

Analyze experiment results:

python -m scripts.analyze_experiment   --experiment-config config/experiment.yaml   --sample-per-arm 10

Repository Structure

config/
scripts/
src/
logs/
results/
docs/

Results Preview

Experiments estimate treatment effects using RQI (Response Quality Index) with confidence intervals. Multiple experiment suites evaluate prompt strategies, temperature settings, and model scaling effects.

About

Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages