A benchmark harness for testing Claude Code plugins against the OOLONG long-context reasoning benchmark. Designed specifically for A/B testing the rlm-rs memory plugin against baseline truncation strategies.
OOLONG-Pairs evaluates long-context reasoning capabilities by presenting tasks that require understanding large documents (100k+ characters). This harness compares two strategies:
- Truncation Strategy: Naive context truncation to fit within window limits
- RLM-RS Strategy: Recursive Language Model chunking via the rlm-rs plugin
# Clone the repository
git clone https://github.com/zircote/oolong-pairs.git
cd oolong-pairs
# Install with uv
uv sync
# Or with pip
pip install -e .- Python 3.11+
- Claude CLI installed and configured (
claude --version) - For RLM-RS strategy: rlm-rs installed (
cargo install rlm-rs)
Run benchmarks programmatically using the Claude CLI:
# Run with truncation strategy
oolong-pairs run --strategy truncation --limit 10
# Run with RLM-RS chunking strategy
oolong-pairs run --strategy rlm_rs --limit 10
# Filter by dataset and context length
oolong-pairs run --strategy rlm_rs --dataset trec_coarse --min-context 100000For integration with Claude Code sessions using hooks:
# Set environment variables
export OOLONG_STATE_DIR=/tmp/oolong-pairs
export OOLONG_DB_PATH=data/benchmark.db
# Run via Python orchestrator
python -c "
from oolong_pairs.orchestrator import HooksOrchestrator
from oolong_pairs.models import Strategy
from pathlib import Path
orch = HooksOrchestrator(
strategy=Strategy.RLM_RS,
db_path=Path('data/benchmark.db'),
)
run_id = orch.run_benchmark(limit=5)
print(f'Completed run: {run_id}')
"# Show results for a specific run
oolong-pairs show <run_id>
# List recent runs
oolong-pairs list-runs
# Compare two runs
oolong-pairs compare <run_id_1> <run_id_2>
# Export results
oolong-pairs export <run_id> results.json --format json# View dataset statistics
oolong-pairs stats --dataset trec_coarse| Command | Description |
|---|---|
run |
Execute benchmark with specified strategy |
show |
Display results for a benchmark run |
compare |
Compare two benchmark runs side-by-side |
list-runs |
List recent benchmark runs |
export |
Export results to JSON, JSONL, or CSV |
stats |
Show dataset statistics |
| Option | Default | Description |
|---|---|---|
--strategy |
Required | truncation or rlm_rs |
--mode |
sdk |
Execution mode: sdk or hooks |
--limit |
None | Maximum tasks to run |
--min-context |
100,000 | Minimum context length in characters |
--dataset |
trec_coarse |
Dataset filter |
--db |
data/benchmark.db |
Database path |
Scoring follows the OOLONG paper methodology:
- Numeric answers:
score = 0.75^|error|where error is the absolute difference - Label answers: Exact match (case-insensitive)
- Comparison answers: Semantic match for more/less/same variants
- Date answers: Exact match
oolong-pairs/
├── src/oolong_pairs/
│ ├── cli.py # Click CLI interface
│ ├── dataset.py # HuggingFace dataset loading
│ ├── models.py # Pydantic data models
│ ├── orchestrator.py # Hooks mode orchestration
│ ├── scoring.py # Answer scoring logic
│ ├── storage.py # SQLite persistence
│ └── strategies.py # Execution strategies
├── hooks/
│ ├── hooks.json # Claude Code hook configuration
│ ├── session_start.py # Injects benchmark context
│ └── stop.py # Captures and scores answers
└── tests/
└── test_scoring.py # Scoring logic tests
Truncates context to fit within the context window (default 180k chars):
- Keeps first 60% and last 40% of content
- Simple baseline for comparison
Uses the Recursive Language Model pattern:
- Chunks document using rlm-rs semantic chunker
- Processes each chunk with Haiku (subcall model)
- Synthesizes findings with Sonnet (main model)
- Returns final answer
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=oolong_pairsResults are stored in SQLite with two tables:
- runs: Benchmark run metadata (id, timestamp, strategy, mode, stats)
- results: Individual task results (task_id, score, latency, answer, error)
MIT