LLM Evaluation Framework
ELO ratings β’ Arena battles β’ Benchmark testing β’ Cost tracking
Documentation β’ Live Leaderboard β’ Arena β’ Frame.dev
promptmachine-eval is a Python toolkit for evaluating and comparing Large Language Models. Built by Frame.dev as part of PromptMachine.
- π ELO Rating System β Chess-style ratings for fair LLM comparisons
- βοΈ Arena Battles β Head-to-head comparisons with LLM-as-judge
- π Benchmarks β Run standard evals (MMLU, GSM8K, HumanEval)
- π― Smart Matchmaking β Monte Carlo sampling for informative pairings
- π° Cost Tracking β Real-time token counting and spend estimation
- π Reports β Generate Markdown evaluation reports
pip install promptmachine-evalFor development:
pip install promptmachine-eval[dev]# Set your API keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
# Test a prompt across models
pm-eval test "Explain quantum computing simply" \
--models gpt-4o-mini,claude-3-5-haiku
# Run a head-to-head battle
pm-eval battle "Write a haiku about coding" \
-a gpt-4o -b claude-3-5-sonnet
# Estimate costs before running
pm-eval cost "Your long prompt..." \
--models gpt-4o,gpt-4o-mini,claude-3-5-sonnet
# List all supported models and pricing
pm-eval modelsimport asyncio
from promptmachine_eval import EloCalculator, BattleRunner, PromptTester
# --- ELO Calculations ---
elo = EloCalculator()
# Calculate rating changes after a battle
new_a, new_b = elo.update_ratings(
rating_a=1200,
rating_b=1000,
score_a=1.0 # A wins
)
print(f"New ratings: A={new_a:.0f}, B={new_b:.0f}")
# --- Run Arena Battle ---
runner = BattleRunner(
openai_api_key="sk-...",
anthropic_api_key="sk-ant-..."
)
result = asyncio.run(runner.battle(
prompt="Write a function to reverse a linked list",
model_a="gpt-4o",
model_b="claude-3-5-sonnet",
judge_model="gpt-4o-mini"
))
print(f"Winner: {result.winner}")
print(f"Reasoning: {result.judgement.reasoning}")
print(f"Cost: ${result.total_cost:.4f}")
# --- Test Multiple Models ---
tester = PromptTester(openai_api_key="sk-...")
results = asyncio.run(tester.test(
prompt="Explain recursion to a beginner",
models=["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
))
for r in results:
print(f"{r.model}: {r.latency_ms}ms, ${r.cost:.4f}")Select optimal battle pairings using Monte Carlo simulation:
from promptmachine_eval import MatchmakingService, ModelInfo
service = MatchmakingService()
models = [
ModelInfo(id="gpt4o", rating=1200, sd=100, battles_count=50),
ModelInfo(id="claude", rating=1180, sd=120, battles_count=40),
ModelInfo(id="gemini", rating=1100, sd=200, battles_count=10),
]
# Get optimal pairing (balances competitiveness + uncertainty)
model_a, model_b = service.select_pair_for_battle(models)
print(f"Recommended battle: {model_a.id} vs {model_b.id}")Create promptmachine.yaml in your project:
version: 1
default_models:
- gpt-4o-mini
- claude-3-5-haiku
battle:
judge_model: gpt-4o-mini
temperature: 0.7
elo:
k_factor: 32
initial_rating: 1000
limits:
max_cost_per_test: 0.10
daily_budget: 5.00Or use environment variables:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1-preview, o1-mini |
| Anthropic | claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus |
| OpenRouter | gemini-pro-1.5, llama-3.1-70b, mistral-large, deepseek-coder, qwen-max, + more |
View full pricing table
| Model | Input ($/1K) | Output ($/1K) |
|---|---|---|
| gpt-4o | $0.0025 | $0.01 |
| gpt-4o-mini | $0.00015 | $0.0006 |
| claude-3-5-sonnet | $0.003 | $0.015 |
| claude-3-5-haiku | $0.001 | $0.005 |
| gemini-pro-1.5 | $0.00125 | $0.005 |
| llama-3.1-70b | $0.00052 | $0.00075 |
We use a modified ELO system inspired by Chatbot Arena:
from promptmachine_eval import EloCalculator, EloConfig
# Custom configuration
config = EloConfig(
k_factor=32, # Rating volatility (higher = more change)
initial_rating=1000,
initial_sd=350, # Uncertainty (decreases with more battles)
)
elo = EloCalculator(config)
# Expected win probability
prob = elo.expected_score(1200, 1000)
print(f"1200-rated has {prob:.1%} chance vs 1000-rated")
# Output: 1200-rated has 75.9% chance vs 1000-rated
# With uncertainty (Monte Carlo)
prob = elo.win_probability(1200, 1000, sd_a=100, sd_b=200)- π Full Documentation
- π API Reference
- π‘ Examples
- π― Live Leaderboard
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone the repo
git clone https://github.com/framersai/promptmachine-eval.git
cd promptmachine-eval
# Install for development
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check .
black --check .
mypy src/MIT License β see LICENSE for details.
π PromptMachine β’ π’ Frame.dev β’ π GitHub β’ π¦ Twitter
Built with β€οΈ by Frame.dev
Questions? team@frame.dev