The open benchmark for AI coding agents — compare resolution rates, cost, and speed on real-world GitHub issues from SWE-bench Verified.
Benchmark any coding agent (Claude Code, Codex, Cursor, Augment, Windsurf, OpenHands, and more) on a curated 100-task subset of SWE-bench Verified. Captures pass@1 resolution rates, cost per task, duration, and token usage.
Default configuration: Claude Code + vexp — context-aware code intelligence that delivers the highest resolution rate at the lowest cost per task.
Evaluated on a 100-task subset of SWE-bench Verified. All agents use Claude Opus 4.5 for a fair, apples-to-apples comparison.
| Agent | Pass@1 | $/task | Unique Wins |
|---|---|---|---|
| vexp + Claude Code | 73.0% | $0.67 | 7–10 |
| Live-SWE-Agent | 72.0% | $0.86 | — |
| OpenHands | 70.0% | $1.77 | — |
| Sonar Foundation | 70.0% | $1.98 | — |
vexp resolves more issues at the lowest cost per task — 22% cheaper than the next best agent.
Generate comparison charts: node dist/cli.js compare results/swebench-2026-03-22.jsonl
External resolution data sourced from swe-bench/experiments. Cost data sourced from each agent's published benchmarks (see data sources below).
git clone https://github.com/Vexp-ai/vexp-swe-bench.git
cd vexp-swe-bench
# One command setup (Python >= 3.10, Node >= 18, Git required)
./setup.sh
# Run the benchmark
source .venv/bin/activate
node dist/cli.js runThe setup script handles Node dependencies, Python venv, pip packages, SWE-bench Verified dataset download, 100-task subset generation, and TypeScript build.
Note: vexp Pro or Team plan is required to run with vexp enabled. The CLI will prompt you to activate a license at first run. Use code BENCHMARK at vexp.dev/#pricing for 14 days of Pro — free.
- Node.js >= 18
- Python >= 3.10 (required by
swebenchevaluation) - Git
- Docker (for accurate test evaluation)
- A coding agent CLI (default: Claude Code)
- vexp Pro or Team plan (auto-detected; free 14-day trial with code
BENCHMARK)
To run without vexp:
node dist/cli.js run --no-vexpnode dist/cli.js run [options]
Options:
--model <model> Model to use (default: "claude-opus-4-5-20251101")
--agent <name> Agent adapter (default: "claude-code")
--instances <ids> Comma-separated instance IDs, or "*" for all
--data <jsonl> Custom JSONL path (default: bundled 100-task subset)
--max-turns <n> Max agentic turns per instance (default: 250)
--cost-limit <usd> Max cost per instance in USD, 0 = unlimited (default: 3)
--timeout <s> Per-command timeout in seconds, 0 = none (default: 0)
--no-vexp Run without vexp enhancement
--output <dir> Output directory (default: results/)
--resume <jsonl> Resume from an interrupted run (skips completed instances)
--dry-run Preview without executingThe defaults are aligned with mini-SWE-agent v2: 250 turns, $3/task cost limit, no global timeout.
node dist/cli.js evaluate results/swebench-2026-03-22.jsonl \
--dataset swebench-verified-full.jsonl
Options:
--mode <mode> docker or lightweight (default: docker)
--dataset <jsonl> Full SWE-bench Verified JSONL (required for Docker eval)
--timeout <s> Per-instance eval timeout (default: 300)Docker mode runs the actual Python test suite for each instance. Lightweight mode only checks if the patch is non-empty.
node dist/cli.js compare results/swebench-2026-03-22.jsonl
Options:
--baseline <jsonl> Baseline results JSONL (no-vexp run)
--output <dir> Output directory (default: plots/)
--dpi <n> Chart resolution (default: 150)Automatically loads all external agent data from data/external/ and generates comparison charts. Only agents using Claude Opus 4.5 are included for a fair comparison.
| Agent | Date | Resolution data | Cost data |
|---|---|---|---|
| OpenHands | Nov 2025 | swe-bench/experiments | OpenHands Index |
| Live-SWE-Agent | Dec 2025 | swe-bench/experiments | Live-SWE-Agent Leaderboard |
| Sonar Foundation | Dec 2025 | swe-bench/experiments | GitHub README |
node dist/cli.js listIf the benchmark is interrupted (network error, crash, etc.), resume without losing progress:
node dist/cli.js run --resume results/swebench-2026-03-22.jsonlThis reads the existing JSONL, skips completed instances, and appends new results to the same file.
This harness is agent-agnostic — bring your own coding agent and benchmark it on the same tasks.
- Cursor — AI-first code editor
- Augment Code — AI coding assistant
- Codex CLI — OpenAI's coding agent
- Gemini CLI — Google's coding agent
- Your own agent — any CLI that can edit code
- Create
src/agents/your-agent.tsimplementing theAgentAdapterinterface - Register it in
src/agents/registry.ts - Run:
node dist/cli.js run --agent your-agent --no-vexp
See docs/CONTRIBUTING.md for a step-by-step guide with code templates.
Run the benchmark, then open a PR with:
- Your adapter code
- Results JSONL
- Comparison:
node dist/cli.js compare your-results.jsonl
We'll add your agent to the leaderboard.
src/
├── cli.ts # CLI entry point
├── types.ts # Shared type definitions
├── agents/
│ ├── adapter.ts # AgentAdapter interface
│ ├── claude-code.ts # Claude Code adapter (with real-time cost limit)
│ └── registry.ts # Agent lookup
├── harness/
│ ├── orchestrator.ts # Main benchmark loop (indexes once per repo)
│ ├── loader.ts # Load instances from JSONL
│ └── repo.ts # Git operations (clone, reset, patch capture, cleanup)
├── vexp/
│ ├── ensure.ts # Auto-detect/install vexp, license check, npm version check
│ ├── enhancer.ts # vexp setup per repo (CLAUDE.md, hooks, MCP config, index)
│ ├── daemon.ts # vexp daemon lifecycle (process group cleanup)
│ └── metrics.ts # vexp metrics collection from SQLite
├── evaluate/
│ └── evaluator.ts # Docker + lightweight evaluation
├── metrics/
│ ├── stream-parser.ts # Parse Claude stream-json output
│ └── pricing.ts # Model pricing table
├── compare/
│ └── compare.ts # Multi-agent comparison + terminal table
└── ui/
├── banner.ts # CLI branding + promo box
└── progress.ts # Progress bar (X patched, ETA)
The 100-task subset is selected via stratified sampling from the full 500-task SWE-bench Verified dataset:
- 100% repository coverage — all 12 repositories represented proportionally
- Complexity-aligned — subset median complexity (22) matches full dataset median (23)
- Outlier filtering — complexity ceiling ≤ 250 removes extreme outliers (~1% of instances)
- Statistical power — 100 instances provide a ±8.7% margin of error at 95% confidence
Comparison with external agents uses their per-instance resolution data on the exact same 100 tasks — no extrapolation needed.
See docs/TASK_SELECTION.md for the full methodology and repository distribution table.
Results are written as JSONL (one JSON object per line). Each entry includes:
| Field | Description |
|---|---|
instanceId |
SWE-bench instance identifier |
repo |
GitHub repository |
model |
Model used |
agent |
Agent adapter name |
inputTokens / outputTokens |
Token usage |
costUsd |
Estimated cost for this instance |
numTurns |
Agentic turns taken |
durationMs |
Wall-clock time |
modelPatch |
Git diff produced by the agent |
resolved |
true / false / null (unevaluated) |
vexpMetrics |
vexp-specific metrics (null if --no-vexp) |
vexp is a context-aware code intelligence layer for AI coding agents. It pre-indexes your codebase into a semantic graph, then delivers precisely ranked context in a single MCP call.
Why it matters for benchmarks:
- Fewer agentic turns → lower cost per task
- Better context → higher resolution rate
- Works with any agent that supports MCP
In this benchmark, vexp-augmented Claude Code achieves 73% pass@1 at $0.67/task — the best cost-efficiency among all tested agents.
Run the benchmark on your coding agent in under 10 minutes:
git clone https://github.com/nicobailon/vexp-swe-bench.git
cd vexp-swe-bench && ./setup.sh
source .venv/bin/activate
node dist/cli.js runUse code BENCHMARK at vexp.dev/#pricing for 14 days of vexp Pro — free.
MIT — see LICENSE.
If this benchmark is useful, please star the repo — it helps others find it.
Built by vexp.dev