diff --git a/.claude/skills/acp-harness/assets/docker-compose.acp.yml b/.claude/skills/acp-harness/assets/docker-compose.acp.yml deleted file mode 100644 index bc867dc..0000000 --- a/.claude/skills/acp-harness/assets/docker-compose.acp.yml +++ /dev/null @@ -1,19 +0,0 @@ -# ACP Harness Docker Compose Configuration -# -# Example docker-compose for running ACP evaluations. -# Copy this to your project and customize as needed. -# -# Usage: -# ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.acp.yml run --rm acp-harness - -services: - acp-harness: - build: - context: . - dockerfile: Dockerfile.acp - environment: - - ANTHROPIC_API_KEY - volumes: - # Mount output directory to persist results - - ./results:/app/results - command: ["bunx", "@plaited/acp-harness", "prompts.jsonl", "-o", "results/output.jsonl"] diff --git a/.claude/skills/acp-harness/SKILL.md b/.claude/skills/agent-eval-harness/SKILL.md similarity index 62% rename from .claude/skills/acp-harness/SKILL.md rename to .claude/skills/agent-eval-harness/SKILL.md index 053d7d2..c9074cf 100644 --- a/.claude/skills/acp-harness/SKILL.md +++ b/.claude/skills/agent-eval-harness/SKILL.md @@ -1,20 +1,20 @@ --- -name: acp-harness -description: CLI tool for capturing agent trajectories. Execute prompts against ACP-compatible agents, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. +name: agent-eval-harness +description: CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. compatibility: Bun >= 1.2.9 --- -# ACP Harness +# Agent Eval Harness ## Purpose -CLI tool for capturing trajectories from ACP-compatible agents, optimized for TypeScript/JavaScript projects using Bun. +CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun. **The harness captures. You score.** | Harness Provides | You Provide | |------------------|-------------| -| Prompt execution against ACP agents | Scoring logic (Braintrust, custom scripts) | +| Prompt execution via headless adapters | Scoring logic (Braintrust, custom scripts) | | Full trajectory capture (thoughts, tools, plans) | Pass/fail determination via graders | | Structured JSONL output | LLM-as-judge prompts | | Reproducible execution environment | CI integration, golden file comparison | @@ -29,10 +29,10 @@ CLI tool for capturing trajectories from ACP-compatible agents, optimized for Ty ```bash # Run without installing (recommended) -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Or install as project dependency -bun add @plaited/acp-harness +bun add @plaited/agent-eval-harness ``` ## Core Principle: Capture Once, Derive Many Views @@ -40,7 +40,7 @@ bun add @plaited/acp-harness ```mermaid flowchart LR Prompts["prompts.jsonl"] --> Capture["capture/trials"] - Agent["ACP Agent"] --> Capture + Schema["headless schema"] --> Capture Capture --> Results["results.jsonl (full trajectory)"] Results --> Summarize["summarize"] Results --> Calibrate["calibrate"] @@ -56,16 +56,28 @@ flowchart LR ## Commands +### Core Commands + | Command | Input | Output | Purpose | |---------|-------|--------|---------| -| `capture` | prompts.jsonl + agent | results.jsonl | Trajectory capture (full) | -| `trials` | prompts.jsonl + agent | trials.jsonl | Multi-run + optional metrics | +| `capture` | prompts.jsonl + schema | results.jsonl | Trajectory capture (full) | +| `trials` | prompts.jsonl + schema | trials.jsonl | Multi-run + optional metrics | | `summarize` | results.jsonl | summary.jsonl or .md | Derive compact views | | `calibrate` | results.jsonl | calibration.md | Sample failures for review | | `validate-refs` | prompts.jsonl | validation.jsonl | Check reference solutions | | `balance` | prompts.jsonl | balance.json | Analyze test set coverage | | `schemas` | (none) | JSON Schema | Export schemas for non-TS users | +### Pipeline Commands (Unix-style) + +| Command | Input | Output | Purpose | +|---------|-------|--------|---------| +| `run` | prompts.jsonl + schema | raw.jsonl | Execute prompts, raw output | +| `extract` | raw.jsonl + schema | extracted.jsonl | Parse trajectories | +| `grade` | extracted.jsonl + grader | graded.jsonl | Apply grader scoring | +| `format` | results.jsonl | jsonl/markdown/csv | Convert output format | +| `compare` | multiple results.jsonl | comparison.jsonl | Compare multiple runs | + All commands support optional `--grader ./grader.ts` for scoring. ## Capture Command @@ -73,7 +85,7 @@ All commands support optional `--grader ./grader.ts` for scoring. ### Basic Usage ```bash -bunx @plaited/acp-harness capture [args...] [options] +bunx @plaited/agent-eval-harness capture --schema [options] ``` ### Arguments @@ -81,25 +93,26 @@ bunx @plaited/acp-harness capture [args...] [options] | Argument/Flag | Description | Default | |------|-------------|---------| | `prompts.jsonl` | Input file with prompts to execute | Required | -| `command [args]` | ACP agent command (e.g., `bunx claude-code-acp`) | Required | +| `-s, --schema` | Path to headless adapter schema | Required | | `-o, --output` | Output file/path | stdout | -| `-c, --cwd` | Working directory for agent (agents auto-discover MCP configs from here) | current | +| `-c, --cwd` | Working directory for agent | current | | `-t, --timeout` | Request timeout in ms | `60000` | | `--progress` | Show progress to stderr | false | | `--append` | Append to output file | false | | `-g, --grader` | Path to grader module | none | +| `--debug` | Show detailed CLI output for debugging | false | ### Examples ```bash # Basic capture -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Using a local adapter script -bunx @plaited/acp-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl # With grader (adds score to each result) -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl ``` ## Trials Command @@ -108,10 +121,10 @@ Run each prompt multiple times for pass@k/pass^k analysis. ```bash # Capture only (no grader) -bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 -o trials.jsonl +bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl # With grader (computes pass@k, pass^k) -bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl ``` ### Output @@ -132,10 +145,10 @@ Derive compact views from full trajectory results. ```bash # Summary JSONL (for jq analysis) -bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl +bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl # Markdown (for LLM-as-judge) -bunx @plaited/acp-harness summarize results.jsonl --markdown -o results.md +bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md ``` ## Calibrate Command @@ -144,10 +157,10 @@ Sample failures for grader review. Calibration helps you distinguish between **a ```bash # Sample failures for human review -bunx @plaited/acp-harness calibrate results.jsonl --sample 10 -o calibration.md +bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md # Re-score with different grader to compare -bunx @plaited/acp-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md +bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md ``` See [eval-concepts.md](references/eval-concepts.md#grader-calibration) for why calibration matters. @@ -158,7 +171,7 @@ Check that reference solutions pass your grader before evaluating agents. ```bash # Validate reference solutions -bunx @plaited/acp-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl +bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl # Check for failures cat validation.jsonl | jq 'select(.pass == false)' @@ -193,10 +206,10 @@ Analyze test set coverage to ensure balanced evaluation. ```bash # Analyze prompt distribution -bunx @plaited/acp-harness balance prompts.jsonl -o balance.json +bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json # Pretty print -bunx @plaited/acp-harness balance prompts.jsonl | jq . +bunx @plaited/agent-eval-harness balance prompts.jsonl | jq . ``` ### Why Use This? @@ -235,21 +248,151 @@ Include both positive and negative cases: See [eval-concepts.md](references/eval-concepts.md#test-set-balance) for more on balanced test sets. +## Pipeline Workflow + +The pipeline commands enable Unix-style composition for flexible evaluation workflows. + +### Full Pipeline Example + +```bash +# Execute → Extract → Grade → Format in one pipeline +cat prompts.jsonl | \ + bunx @plaited/agent-eval-harness run -s claude.json | \ + bunx @plaited/agent-eval-harness extract -s claude.json | \ + bunx @plaited/agent-eval-harness grade -g ./grader.ts | \ + bunx @plaited/agent-eval-harness format -f markdown > report.md +``` + +### Run Command + +Execute prompts and output raw results. Three modes available: + +```bash +# Schema mode (recommended) +bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json + +# Simple mode: {} placeholder substitution +bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json" + +# Shell mode: $PROMPT environment variable +bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json' +``` + +> **⚠️ Security Warning:** The `--simple` and `--shell` modes execute prompts via shell commands. Prompts are escaped but **do not use untrusted prompt content** with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use `--schema` mode (headless adapter) for untrusted inputs. + +### Extract Command + +Parse raw output into structured trajectories: + +```bash +# From file +bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl + +# Piped from run +bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \ + bunx @plaited/agent-eval-harness extract -s claude.json +``` + +### Grade Command + +Apply grader to extracted results: + +```bash +bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl +``` + +### Format Command + +Convert results to different output formats: + +```bash +# Markdown report +bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md + +# CSV for spreadsheets +bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv + +# JSONL (pass-through, default) +bunx @plaited/agent-eval-harness format results.jsonl --style jsonl +``` + +### Compare Command + +Compare multiple runs of the same prompts: + +```bash +# Compare multiple result files +bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl run3.jsonl \ + --grader ./compare-grader.ts -o comparison.jsonl + +# With explicit labels +bunx @plaited/agent-eval-harness compare \ + --run "with-mcp:results-mcp.jsonl" \ + --run "vanilla:results-vanilla.jsonl" \ + --grader ./compare-grader.ts +``` + +**Use cases for compare:** +- Same agent, different MCP servers +- Same agent, different skills enabled +- Same agent, different model versions +- Different agents entirely + +### Comparison Grader Interface + +```typescript +import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline' + +export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => { + // runs is Record + // Return rankings from best to worst + return { + rankings: [ + { run: 'with-mcp', rank: 1, score: 0.9 }, + { run: 'vanilla', rank: 2, score: 0.7 }, + ], + reasoning: 'MCP run produced more accurate output' + } +} +``` + +### Pipeline Workflow Diagram + +```mermaid +flowchart LR + Prompts["prompts.jsonl"] --> Run["run"] + Schema["headless schema"] --> Run + Run --> Raw["raw.jsonl"] + Raw --> Extract["extract"] + Schema --> Extract + Extract --> Extracted["extracted.jsonl"] + Extracted --> Grade["grade"] + Grader["grader.ts"] --> Grade + Grade --> Graded["graded.jsonl"] + Graded --> Format["format"] + Format --> Output["report.md / .csv / .jsonl"] + + Graded --> Compare["compare"] + Results2["other runs..."] --> Compare + CompareGrader["compare-grader.ts"] --> Compare + Compare --> Comparison["comparison.jsonl"] +``` + ## Schemas Command Export JSON schemas for non-TypeScript tools. ```bash # List available schemas -bunx @plaited/acp-harness schemas +bunx @plaited/agent-eval-harness schemas # Export all schemas as JSON -bunx @plaited/acp-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Export specific schema -bunx @plaited/acp-harness schemas CaptureResult --json -bunx @plaited/acp-harness schemas TrialResult --json -bunx @plaited/acp-harness schemas GraderResult --json +bunx @plaited/agent-eval-harness schemas CaptureResult --json +bunx @plaited/agent-eval-harness schemas TrialResult --json +bunx @plaited/agent-eval-harness schemas GraderResult --json ``` ### Available Schemas @@ -269,7 +412,7 @@ Export schemas for validation in Python, Go, etc.: ```bash # Export all schemas -bunx @plaited/acp-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Use in Python with jsonschema python -c " @@ -295,7 +438,7 @@ Graders provide semantic pass/fail scoring for captured trajectories. The harnes ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') @@ -331,7 +474,7 @@ print(json.dumps({ ```bash chmod +x ./grader.py -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl ``` See [graders.md](references/graders.md) for complete polyglot grader documentation including shell scripts and LLM-as-judge patterns. @@ -375,7 +518,7 @@ Full trajectory JSONL (always): ], "metadata": { "category": "search", - "agent": "bunx claude-code-acp", + "agent": "--schema ./claude.json", "trajectoryRichness": "full", "turnCount": 1 }, @@ -413,7 +556,7 @@ Full trajectory JSONL (always): Consumers can import Zod schemas directly: ```typescript -import { CaptureResultSchema, TrialResultSchema } from '@plaited/acp-harness/schemas' +import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas' // Validate external data const result = CaptureResultSchema.parse(jsonData) @@ -426,8 +569,8 @@ const jsonSchema = z.toJSONSchema(CaptureResultSchema) Or export JSON schemas for non-TypeScript tools: ```bash -bunx @plaited/acp-harness schemas --json -o schemas.json -bunx @plaited/acp-harness schemas CaptureResult --json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas CaptureResult --json ``` ## Execution Environment @@ -436,10 +579,10 @@ bunx @plaited/acp-harness schemas CaptureResult --json ```bash # Run integration tests via Docker -docker compose -f docker-compose.test.yml run --rm acp-test +docker compose -f docker-compose.test.yml run --rm test # Or with explicit API keys -ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm acp-test +ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test ``` ### Docker Requirements @@ -465,13 +608,13 @@ Run with the headless adapter: ```bash # Using Claude Code via headless adapter -bunx @plaited/acp-harness capture multi-turn.jsonl \ - bunx @plaited/acp-harness headless --schema ./claude-headless.json \ +bunx @plaited/agent-eval-harness capture multi-turn.jsonl \ + bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \ -o results.jsonl # Using Gemini CLI via headless adapter -GEMINI_API_KEY=... bunx @plaited/acp-harness capture multi-turn.jsonl \ - bunx @plaited/acp-harness headless --schema ./gemini-headless.json \ +GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \ + bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \ -o results.jsonl ``` @@ -493,7 +636,7 @@ cat results.jsonl | jq 'select(.metadata.category == "ui")' cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add' # Summarize for quick analysis -bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl +bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl ``` See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers. @@ -502,7 +645,7 @@ See [downstream.md](references/downstream.md) for integration patterns with Brai | Resource | Description | |----------|-------------| -| `bunx @plaited/acp-harness` | CLI help | +| `bunx @plaited/agent-eval-harness` | CLI help | | [output-formats.md](references/output-formats.md) | JSONL schemas, command details | | [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, custom scorers) | | [graders.md](references/graders.md) | Polyglot grader documentation (TypeScript, Python, shell) | @@ -511,5 +654,4 @@ See [downstream.md](references/downstream.md) for integration patterns with Brai ## Related -- **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK for programmatic access -- **[@zed-industries/claude-code-acp](https://www.npmjs.com/package/@zed-industries/claude-code-acp)** - Claude Code ACP adapter +- **[headless-adapters skill](../headless-adapters/SKILL.md)** - Schema-driven adapters for headless CLI agents diff --git a/.claude/skills/acp-harness/assets/Dockerfile.acp b/.claude/skills/agent-eval-harness/assets/Dockerfile.eval similarity index 59% rename from .claude/skills/acp-harness/assets/Dockerfile.acp rename to .claude/skills/agent-eval-harness/assets/Dockerfile.eval index 14e6a8a..f1241e5 100644 --- a/.claude/skills/acp-harness/assets/Dockerfile.acp +++ b/.claude/skills/agent-eval-harness/assets/Dockerfile.eval @@ -1,11 +1,11 @@ -# ACP Harness Docker Configuration +# Agent Eval Harness Docker Configuration # -# Example Dockerfile for running ACP evaluations in an isolated container. +# Example Dockerfile for running agent evaluations in an isolated container. # Copy this to your project and customize as needed. # # Usage: -# docker build -f Dockerfile.acp -t acp-harness . -# docker run --rm -e ANTHROPIC_API_KEY acp-harness bunx @plaited/acp-harness prompts.jsonl +# docker build -f Dockerfile.acp -t agent-eval-harness . +# docker run --rm -e ANTHROPIC_API_KEY agent-eval-harness bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json FROM oven/bun:1.2.9 diff --git a/.claude/skills/agent-eval-harness/assets/docker-compose.eval.yml b/.claude/skills/agent-eval-harness/assets/docker-compose.eval.yml new file mode 100644 index 0000000..9b8e4b0 --- /dev/null +++ b/.claude/skills/agent-eval-harness/assets/docker-compose.eval.yml @@ -0,0 +1,19 @@ +# Agent Eval Harness Docker Compose Configuration +# +# Example docker-compose for running agent evaluations. +# Copy this to your project and customize as needed. +# +# Usage: +# ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.eval.yml run --rm agent-eval-harness + +services: + agent-eval-harness: + build: + context: . + dockerfile: Dockerfile.eval + environment: + - ANTHROPIC_API_KEY + volumes: + # Mount output directory to persist results + - ./results:/app/results + command: ["bunx", "@plaited/agent-eval-harness", "capture", "prompts.jsonl", "--schema", "./claude.json", "-o", "results/output.jsonl"] diff --git a/.claude/skills/acp-harness/references/docker-evals.md b/.claude/skills/agent-eval-harness/references/docker-evals.md similarity index 89% rename from .claude/skills/acp-harness/references/docker-evals.md rename to .claude/skills/agent-eval-harness/references/docker-evals.md index 283473b..89a2769 100644 --- a/.claude/skills/acp-harness/references/docker-evals.md +++ b/.claude/skills/agent-eval-harness/references/docker-evals.md @@ -1,6 +1,6 @@ # Running Evals in Docker -Docker provides a consistent, isolated environment for running ACP evaluations. This guide covers lessons learned from real debugging sessions. +Docker provides a consistent, isolated environment for running agent evaluations. This guide covers lessons learned from real debugging sessions. ## Why Docker? @@ -98,7 +98,7 @@ which gemini # fails **Solution:** Verify symlinks point to accessible locations: ```bash # Debug inside container -docker compose run --rm acp-test bash -c 'which gemini && ls -la $(which gemini)' +docker compose run --rm test bash -c 'which gemini && ls -la $(which gemini)' ``` ### 5. Environment Variables Not Passed @@ -118,7 +118,7 @@ When tests fail in Docker, run these checks: ```bash # 1. Verify CLI installation and access -docker compose run --rm acp-test bash -c ' +docker compose run --rm test bash -c ' echo "=== Node.js ===" && node --version && echo "=== Bun ===" && bun --version && echo "=== Claude ===" && which claude && claude --version && @@ -126,18 +126,18 @@ docker compose run --rm acp-test bash -c ' ' # 2. Verify environment variables -docker compose run --rm acp-test bash -c ' +docker compose run --rm test bash -c ' echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:+set}" echo "GEMINI_API_KEY: ${GEMINI_API_KEY:+set}" ' # 3. Test CLI directly -docker compose run --rm acp-test bash -c ' +docker compose run --rm test bash -c ' gemini -p "Say hello" --output-format stream-json 2>&1 | head -5 ' # 4. Run as root to isolate permission issues -docker compose run --rm --user root acp-test bash -c 'whoami && which claude' +docker compose run --rm --user root test bash -c 'whoami && which claude' ``` ## CI Integration (GitHub Actions) @@ -147,11 +147,11 @@ test-integration: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - - name: Run ACP integration tests + - name: Run integration tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} - run: docker compose -f docker-compose.test.yml run --rm acp-test + run: docker compose -f docker-compose.test.yml run --rm test ``` ## Version Matrix @@ -169,7 +169,7 @@ Tested configurations: ```yaml services: - acp-test: + test: build: context: . dockerfile: Dockerfile.test diff --git a/.claude/skills/acp-harness/references/downstream.md b/.claude/skills/agent-eval-harness/references/downstream.md similarity index 90% rename from .claude/skills/acp-harness/references/downstream.md rename to .claude/skills/agent-eval-harness/references/downstream.md index 691337c..d3edace 100644 --- a/.claude/skills/acp-harness/references/downstream.md +++ b/.claude/skills/agent-eval-harness/references/downstream.md @@ -21,7 +21,7 @@ Use `summarize` command for quick jq analysis: ```bash # First derive summary -acp-harness summarize results.jsonl -o summary.jsonl +agent-eval-harness summarize results.jsonl -o summary.jsonl # Calculate average duration cat summary.jsonl | jq -s 'map(.duration) | add / length' @@ -172,7 +172,7 @@ Use the `--grader` flag to add scoring to capture results. The harness supports ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') @@ -185,7 +185,7 @@ export const grade: Grader = async ({ input, output, hint, trajectory }) => { ``` ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl ``` ### Python Grader @@ -228,7 +228,7 @@ print(json.dumps({ ```bash chmod +x ./grader.py -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl ``` ### Detection Logic @@ -301,7 +301,7 @@ Use markdown summary for smaller context: import Anthropic from '@anthropic-ai/sdk' // Generate markdown summary first -await Bun.$`acp-harness summarize results.jsonl --markdown -o results.md` +await Bun.$`agent-eval-harness summarize results.jsonl --markdown -o results.md` const client = new Anthropic() const markdown = await Bun.file('results.md').text() @@ -372,25 +372,25 @@ jobs: - uses: actions/checkout@v4 - uses: oven-sh/setup-bun@v2 - - name: Install ACP adapter + - name: Install harness run: npm install -g @zed-industries/claude-code-acp - name: Install dependencies - run: bun add @plaited/acp-harness + run: bun add @plaited/agent-eval-harness - name: Run harness env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | - bunx @plaited/acp-harness capture prompts.jsonl \ + bunx @plaited/agent-eval-harness capture prompts.jsonl \ bunx claude-code-acp \ --progress \ -o results.jsonl - name: Generate summary run: | - bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl - bunx @plaited/acp-harness summarize results.jsonl --markdown -o results.md + bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl + bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md - name: Upload results uses: actions/upload-artifact@v4 @@ -408,8 +408,8 @@ Combine multiple runs: ```bash # Append mode during runs -acp-harness capture prompts-1.jsonl bunx claude-code-acp --append -o combined.jsonl -acp-harness capture prompts-2.jsonl bunx claude-code-acp --append -o combined.jsonl +agent-eval-harness capture prompts-1.jsonl bunx claude-code-acp --append -o combined.jsonl +agent-eval-harness capture prompts-2.jsonl bunx claude-code-acp --append -o combined.jsonl # Merge separate files cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl @@ -422,7 +422,7 @@ cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl Use the `trials` command to measure pass@k/pass^k: ```bash -acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +agent-eval-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl ``` ```typescript diff --git a/.claude/skills/acp-harness/references/eval-concepts.md b/.claude/skills/agent-eval-harness/references/eval-concepts.md similarity index 93% rename from .claude/skills/acp-harness/references/eval-concepts.md rename to .claude/skills/agent-eval-harness/references/eval-concepts.md index 5060f78..e5d6f70 100644 --- a/.claude/skills/acp-harness/references/eval-concepts.md +++ b/.claude/skills/agent-eval-harness/references/eval-concepts.md @@ -55,7 +55,7 @@ Run a prompt 5 times, 3 pass (60% raw pass rate): ```bash # Run many trials to assess capability -acp-harness trials new-prompts.jsonl bunx agent -k 10 --grader ./grader.ts -o capability.jsonl +agent-eval-harness trials new-prompts.jsonl bunx agent -k 10 --grader ./grader.ts -o capability.jsonl # Analyze results cat capability.jsonl | jq 'select(.passAtK > 0.9) | {id, passAtK}' @@ -70,7 +70,7 @@ Questions answered: ```bash # Run fewer trials for known-good tasks -acp-harness trials regression-suite.jsonl bunx agent -k 3 --grader ./grader.ts -o regression.jsonl +agent-eval-harness trials regression-suite.jsonl bunx agent -k 3 --grader ./grader.ts -o regression.jsonl # Fail CI if reliability drops cat regression.jsonl | jq -e 'all(.passExpK > 0.8)' @@ -118,7 +118,7 @@ Week 3: 60% grader pass → 80% actually correct (grader rejected 20% valid) ```bash # Sample 10 failures for human review -acp-harness calibrate results.jsonl --sample 10 -o calibration.md +agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md ``` Review the markdown output and label each sample: @@ -171,7 +171,7 @@ Reference solutions prove a task is solvable before blaming the agent. ```bash # Check that reference solutions pass your grader -acp-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl +agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl # If references fail, your grader or task is broken cat validation.jsonl | jq 'select(.pass == false)' @@ -199,7 +199,7 @@ An eval with only "make X work" misses "don't break Y". ### Using the Balance Command ```bash -acp-harness balance prompts.jsonl -o balance.json +agent-eval-harness balance prompts.jsonl -o balance.json ``` Analyzes: diff --git a/.claude/skills/acp-harness/references/graders.md b/.claude/skills/agent-eval-harness/references/graders.md similarity index 93% rename from .claude/skills/acp-harness/references/graders.md rename to .claude/skills/agent-eval-harness/references/graders.md index a2457e0..fc187af 100644 --- a/.claude/skills/acp-harness/references/graders.md +++ b/.claude/skills/agent-eval-harness/references/graders.md @@ -17,7 +17,7 @@ Export a `grade` function matching the `Grader` type: ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { // Your scoring logic @@ -32,7 +32,7 @@ export const grade: Grader = async ({ input, output, hint, trajectory }) => { **Usage:** ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl ``` ## Python Grader @@ -69,7 +69,7 @@ print(json.dumps({ **Usage:** ```bash chmod +x ./grader.py -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl ``` ## Executable Protocol @@ -141,7 +141,7 @@ Wrap an LLM call in your grader for semantic evaluation: ```typescript // llm-judge.ts import Anthropic from '@anthropic-ai/sdk' -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' const client = new Anthropic() diff --git a/.claude/skills/acp-harness/references/output-formats.md b/.claude/skills/agent-eval-harness/references/output-formats.md similarity index 82% rename from .claude/skills/acp-harness/references/output-formats.md rename to .claude/skills/agent-eval-harness/references/output-formats.md index 8cab355..c70d6ce 100644 --- a/.claude/skills/acp-harness/references/output-formats.md +++ b/.claude/skills/agent-eval-harness/references/output-formats.md @@ -7,27 +7,31 @@ The harness uses a "capture once, derive many views" approach. The `capture` com The `capture` command always outputs full trajectory JSONL: ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl ``` ### Schema ```typescript type CaptureResult = { - id: string // Prompt identifier - input: string // Original prompt text - output: string // Final agent response - hint?: string // Grader context (if provided in prompt) - trajectory: TrajectoryStep[] // Full execution trajectory + id: string // Prompt identifier + input: string | string[] // Single prompt or multi-turn conversation + output: string // Final agent response + hint?: string // Grader context (if provided in prompt) + trajectory: TrajectoryStep[] // Full execution trajectory metadata: Record // Prompt metadata timing: { - start: number // Unix timestamp (ms) - end: number // Unix timestamp (ms) - firstResponse?: number // Time to first response (ms) + start: number // Unix timestamp (ms) + end: number // Unix timestamp (ms) + firstResponse?: number // Time to first response (ms) + sessionCreation: number // Time to create session (ms) + total: number // Total duration (end - start, ms) + inputTokens?: number // Input tokens consumed (if available) + outputTokens?: number // Output tokens generated (if available) } - toolErrors: boolean // Whether any tool calls failed - errors?: string[] // Error messages (if any) - score?: GraderResult // Grader score (if grader was provided) + toolErrors: boolean // Whether any tool calls failed + errors?: string[] // Error messages (if any) + score?: GraderResult // Grader score (if grader was provided) } type TrajectoryStep = @@ -35,7 +39,7 @@ type TrajectoryStep = | { type: 'message'; content: string; timestamp: number; stepId?: string } | { type: 'tool_call' - name: string // Tool title from ACP SDK + name: string // Tool title status: string // pending, in_progress, completed, failed input?: unknown // Raw input parameters output?: unknown // Raw output @@ -63,7 +67,7 @@ type GraderResult = { The `summarize` command derives compact JSONL from full trajectory: ```bash -acp-harness summarize results.jsonl -o summary.jsonl +agent-eval-harness summarize results.jsonl -o summary.jsonl ``` ### Schema @@ -103,7 +107,7 @@ cat summary.jsonl | jq 'select(.output | contains("error"))' The `summarize` command can also produce markdown for LLM-as-judge workflows: ```bash -acp-harness summarize results.jsonl --markdown -o results.md +agent-eval-harness summarize results.jsonl --markdown -o results.md ``` ### Structure @@ -147,7 +151,7 @@ acp-harness summarize results.jsonl --markdown -o results.md The `trials` command produces per-prompt trial results: ```bash -acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +agent-eval-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl ``` ### Schema @@ -224,7 +228,7 @@ The `toolErrors` field indicates whether any tool calls failed during execution: **Note:** `toolErrors` only indicates tool-level failures. For semantic pass/fail (did the agent accomplish the task?), use a grader: ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl ``` ## Input Format @@ -250,7 +254,7 @@ All commands stream output line-by-line as results complete: ```bash # Watch results in real-time -acp-harness capture prompts.jsonl bunx claude-code-acp --progress -o results.jsonl & +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --progress -o results.jsonl & tail -f results.jsonl ``` diff --git a/.claude/skills/acp-adapters/SKILL.md b/.claude/skills/headless-adapters/SKILL.md similarity index 63% rename from .claude/skills/acp-adapters/SKILL.md rename to .claude/skills/headless-adapters/SKILL.md index 0db186a..7242d8e 100644 --- a/.claude/skills/acp-adapters/SKILL.md +++ b/.claude/skills/headless-adapters/SKILL.md @@ -1,10 +1,10 @@ --- -name: acp-adapters -description: Discover, create, and validate ACP adapters for agent integration. Includes scaffolding tools and compliance testing for the Agent Client Protocol. +name: headless-adapters +description: Discover, create, and validate headless adapters for agent integration. Includes scaffolding tools and compliance testing for the Agent Client Protocol. compatibility: Bun >= 1.2.9 --- -# ACP Adapters +# Headless Adapters ## Purpose @@ -13,7 +13,6 @@ Schema-driven adapter for headless CLI agents. **No code required** - just defin | Use Case | Tool | |----------|------| | Wrap headless CLI agent | `headless` command | -| Verify implementation | `adapter:check` command | | Create new schemas | [Schema Creation Guide](references/schema-creation-guide.md) | ## Quick Start @@ -21,21 +20,17 @@ Schema-driven adapter for headless CLI agents. **No code required** - just defin 1. **Check if a schema exists** in [schemas/](schemas/) 2. **Run the adapter:** ```bash - ANTHROPIC_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json - ``` -3. **Validate compliance:** - ```bash - bunx @plaited/acp-harness adapter:check bunx @plaited/acp-harness headless --schema ./my-schema.json + ANTHROPIC_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/claude-headless.json ``` ## CLI Commands ### headless -Schema-driven ACP adapter for ANY headless CLI agent. +Schema-driven adapter for ANY headless CLI agent. ```bash -bunx @plaited/acp-harness headless --schema +bunx @plaited/agent-eval-harness headless --schema ``` **Options:** @@ -82,46 +77,19 @@ Both modes support multi-turn conversations. Send multiple prompts to the same s ```typescript // Create one session, send multiple prompts -const session = await client.createSession({ cwd: PROJECT_ROOT }) +const session = await manager.createSession({ cwd: PROJECT_ROOT }) // Turn 1 -const { updates: turn1 } = await client.promptSync(session.id, createPrompt('Remember: 42')) +const turn1 = await manager.prompt(session.id, 'Remember: 42') // Turn 2 - context is maintained -const { updates: turn2 } = await client.promptSync(session.id, createPrompt('What number?')) +const turn2 = await manager.prompt(session.id, 'What number?') ``` How context is preserved: - **stream mode:** Process stays alive, CLI maintains internal state - **iterative mode:** Adapter builds history using `historyTemplate` from schema ---- - -### adapter:check - -Validate that an adapter implements the ACP protocol correctly. - -```bash -bunx @plaited/acp-harness adapter:check [args...] -``` - -**Options:** -| Flag | Description | Default | -|------|-------------|---------| -| `--timeout` | Timeout for each check in ms | `5000` | -| `--verbose` | Show detailed protocol messages | false | - -**Checks Performed:** - -| Check | Description | -|-------|-------------| -| `spawn` | Adapter can be launched as subprocess | -| `initialize` | Responds to initialize with valid `agentCapabilities` | -| `session/new` | Creates session and returns `sessionId` | -| `session/prompt` | Accepts prompt and emits `session/update` notifications | -| `session/cancel` | Accepts cancel notification gracefully | -| `framing` | All messages are newline-delimited JSON-RPC 2.0 | - ## Pre-built Schemas Tested schemas are available in [schemas/](schemas/): @@ -134,10 +102,10 @@ Tested schemas are available in [schemas/](schemas/): **Usage:** ```bash # Claude Code -ANTHROPIC_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json +ANTHROPIC_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/claude-headless.json # Gemini CLI -GEMINI_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/gemini-headless.json +GEMINI_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/gemini-headless.json ``` ## Agents with Headless CLI Support @@ -160,10 +128,9 @@ GEMINI_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/ac 1. Explore the CLI's `--help` to identify prompt, output, and auto-approve flags 2. Capture sample JSON output from the CLI -3. Map JSONPath patterns to ACP events +3. Map JSONPath patterns to output events 4. Create schema file based on an existing template 5. Test with `headless` command -6. Validate with `adapter:check` See [Schema Creation Guide](references/schema-creation-guide.md) for the complete workflow. @@ -179,7 +146,7 @@ See [Schema Creation Guide](references/schema-creation-guide.md) for the complet | Timeout on prompt | JSONPath not matching | Capture raw CLI output, verify paths - [see guide](references/troubleshooting-guide.md#jsonpath-debugging) | | Empty responses | Content extraction failing | Check extract paths - [see guide](references/troubleshooting-guide.md#output-event-matching) | -**📖 Complete troubleshooting documentation:** [Troubleshooting Guide](references/troubleshooting-guide.md) +**Complete troubleshooting documentation:** [Troubleshooting Guide](references/troubleshooting-guide.md) This guide includes: - Detailed debugging steps for each issue @@ -197,33 +164,13 @@ This guide includes: 2. **Test headless adapter directly:** ```bash - printf '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":1}}\n' | \ - bunx @plaited/acp-harness headless --schema ./my-schema.json - ``` - -3. **Run adapter:check for diagnostics:** - ```bash - bunx @plaited/acp-harness adapter:check \ - bunx @plaited/acp-harness headless --schema ./my-schema.json --verbose + bunx @plaited/agent-eval-harness headless --schema ./my-schema.json -p "Hello" ``` ## External Resources -- **ACP-Compatible Agents**: [agentclientprotocol.com/overview/agents](https://agentclientprotocol.com/overview/agents) - **AgentSkills Spec**: [agentskills.io](https://agentskills.io) -- **ACP Protocol Docs**: Use the MCP server for protocol questions: - ```json - { - "mcpServers": { - "agent-client-protocol-docs": { - "type": "http", - "url": "https://agentclientprotocol.com/mcp" - } - } - } - ``` ## Related -- **[acp-harness skill](../acp-harness/SKILL.md)** - Running evaluations against adapters -- **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK with TypeScript types +- **[agent-eval-harness skill](../agent-eval-harness/SKILL.md)** - Running evaluations against adapters diff --git a/.claude/skills/acp-adapters/references/schema-creation-guide.md b/.claude/skills/headless-adapters/references/schema-creation-guide.md similarity index 87% rename from .claude/skills/acp-adapters/references/schema-creation-guide.md rename to .claude/skills/headless-adapters/references/schema-creation-guide.md index 40fdc12..51eca04 100644 --- a/.claude/skills/acp-adapters/references/schema-creation-guide.md +++ b/.claude/skills/headless-adapters/references/schema-creation-guide.md @@ -4,7 +4,7 @@ Step-by-step workflow for creating headless adapter schemas for CLI coding agent ## Overview -The headless adapter transforms any CLI agent with JSON output into an ACP-compatible adapter. You just need a schema file describing how to interact with the CLI. +The headless adapter transforms any CLI agent with JSON output into a protocol-compatible adapter. You just need a schema file describing how to interact with the CLI. ## Workflow @@ -14,8 +14,7 @@ flowchart TD B --> C["3. Capture Sample Output"] C --> D["4. Map JSONPath Patterns"] D --> E["5. Create Schema File"] - E --> F["6. Test with Headless"] - F --> G["7. Validate with adapter:check"] + E --> F["6. Test with Debug Mode"] ``` ### Step 1: Explore CLI Help @@ -82,7 +81,7 @@ AGENT_API_KEY=... exec -o stream-json "Say hello" | jq -c '.' Analyze the output to create event mappings: -| JSON Event | ACP Event Type | Extract Fields | +| JSON Event | Event Type | Extract Fields | |------------|---------------|----------------| | `{"type": "message", ...}` | `message` | `$.content` | | `{"type": "tool_use", ...}` | `tool_call` | `$.name` (title), `"pending"` (status) | @@ -104,7 +103,7 @@ Use an existing schema as a template: ```bash # Copy from tested schema -cp .claude/skills/acp-adapters/schemas/claude-headless.json ./my-agent-headless.json +cp .claude/skills/headless-adapters/schemas/claude-headless.json ./my-agent-headless.json ``` Modify for your agent: @@ -151,25 +150,21 @@ Run the headless adapter with your schema: ```bash # Test the adapter -AGENT_API_KEY=... bunx @plaited/acp-harness headless --schema ./my-agent-headless.json +AGENT_API_KEY=... bunx @plaited/agent-eval-harness headless --schema ./my-agent-headless.json ``` -### Step 7: Validate with adapter:check +### Step 6: Test with Debug Mode -Verify ACP compliance: +Use debug mode to verify JSONPath extraction: ```bash -bunx @plaited/acp-harness adapter:check \ - bunx @plaited/acp-harness headless --schema ./my-agent-headless.json +AGENT_API_KEY=... bunx @plaited/agent-eval-harness headless --schema ./my-agent-headless.json --debug ``` -All 6 checks should pass: -- `spawn` - Adapter launches -- `initialize` - Protocol handshake works -- `session/new` - Session creation works -- `session/prompt` - Prompt handling works -- `session/cancel` - Cancel is acknowledged -- `framing` - Valid JSON-RPC framing +Debug mode shows: +- Raw CLI output lines +- JSONPath match attempts +- Extracted values for each event ## Schema Field Reference @@ -218,7 +213,7 @@ All 6 checks should pass: **Not yet compatible:** [Copilot CLI](https://docs.github.com/en/copilot/concepts/agents/about-copilot-cli) (no JSON output) -> **Note:** For detailed ACP protocol questions during schema creation, use the `agent-client-protocol-docs` MCP server. See SKILL.md for configuration. +> **Note:** For detailed protocol questions during schema creation, use the `agent-client-protocol-docs` MCP server. See SKILL.md for configuration. ## Troubleshooting @@ -246,7 +241,7 @@ cat raw-output.jsonl | jq '.' ```bash # Test initialize and session creation printf '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":1}}\n{"jsonrpc":"2.0","id":2,"method":"session/new","params":{}}\n' | \ - bun src/headless-cli.ts --schema ./my-schema.json 2>&1 + bunx @plaited/agent-eval-harness headless --schema ./my-schema.json 2>&1 ``` **3. Common JSONPath issues:** @@ -271,6 +266,6 @@ Once your schema is working: 1. Run the integration test suite with your schema 2. Submit a PR to add it to the `schemas/` directory -3. Include the integration test file as `integration_tests/acp-.spec.ts` +3. Include the integration test file as `integration_tests/.spec.ts` Only schemas with passing integration tests are included in the official distribution. diff --git a/.claude/skills/acp-adapters/references/troubleshooting-guide.md b/.claude/skills/headless-adapters/references/troubleshooting-guide.md similarity index 95% rename from .claude/skills/acp-adapters/references/troubleshooting-guide.md rename to .claude/skills/headless-adapters/references/troubleshooting-guide.md index 0e00d1f..4b48baf 100644 --- a/.claude/skills/acp-adapters/references/troubleshooting-guide.md +++ b/.claude/skills/headless-adapters/references/troubleshooting-guide.md @@ -83,7 +83,7 @@ Use wildcard `[*]` syntax in JSONPath expressions to iterate over array items: cat raw-output.jsonl | jq '.message.content[] | select(.type == "tool_use")' ``` -5. **Update schema with correct paths** and test with `adapter:check` +5. **Update schema with correct paths** and test with `headless --debug` --- @@ -162,7 +162,7 @@ Use `stdin: true` when: ``` 3. **Check process spawn in adapter:** - - Enable verbose mode: `adapter:check --verbose` + - Enable verbose mode: `headless --debug --verbose` - Look for command construction in output 4. **Update schema:** @@ -184,7 +184,7 @@ Use `stdin: true` when: Some CLIs have the output format embedded in the base command (e.g., `codex exec --json`), so the schema doesn't need separate `output.flag` and `output.value` fields. However, the `output` field is required by the adapter schema. -Before acp-harness 0.4.3, specifying empty output values would add two empty strings as command arguments: +Before agent-eval-harness 0.4.3, specifying empty output values would add two empty strings as command arguments: ```bash # Schema with empty output @@ -199,9 +199,9 @@ codex exec --json - "" "" ### Solution -**acp-harness 0.4.3+:** Empty output flags are automatically skipped - no changes needed. +**agent-eval-harness 0.4.3+:** Empty output flags are automatically skipped - no changes needed. -**acp-harness 0.4.2 and earlier:** Use a workaround by putting the output format in the command array: +**agent-eval-harness 0.4.2 and earlier:** Use a workaround by putting the output format in the command array: ```json { @@ -217,7 +217,7 @@ Even though Codex doesn't use these flags, this prevents empty strings from bein Use empty `output.flag` and `output.value` when: - CLI has output format embedded in command - No additional flags needed for JSON output -- acp-harness version is 0.4.3 or later +- agent-eval-harness version is 0.4.3 or later **Examples:** - Codex: `codex exec --json` (format is in command) @@ -463,7 +463,7 @@ Output events use a two-step process: This means: - Check if `$.type` equals `"message"` -- If yes, emit an ACP `message` update +- If yes, emit a session `message` update - Extract content from `$.text` ### Wildcard Matching @@ -565,8 +565,8 @@ The `result` configuration marks when the agent is done: 1. **Check if events are being matched at all:** ```bash - # Run adapter:check with verbose mode - adapter:check --verbose -- bunx @plaited/acp-harness headless --schema schema.json + # Run headless --debug with verbose mode + bunx @plaited/agent-eval-harness headless --schema schema.json --debug ``` 2. **Verify JSON structure matches your paths:** @@ -678,7 +678,7 @@ If you've tried all the debugging steps and still can't get your schema working, ``` 4. **Error messages:** - - From `adapter:check` + - From `headless --debug` - From `capture` command - From CLI stderr diff --git a/.claude/skills/acp-adapters/schemas/claude-headless.json b/.claude/skills/headless-adapters/schemas/claude-headless.json similarity index 100% rename from .claude/skills/acp-adapters/schemas/claude-headless.json rename to .claude/skills/headless-adapters/schemas/claude-headless.json diff --git a/.claude/skills/acp-adapters/schemas/gemini-headless.json b/.claude/skills/headless-adapters/schemas/gemini-headless.json similarity index 94% rename from .claude/skills/acp-adapters/schemas/gemini-headless.json rename to .claude/skills/headless-adapters/schemas/gemini-headless.json index 3c351da..acf2f00 100644 --- a/.claude/skills/acp-adapters/schemas/gemini-headless.json +++ b/.claude/skills/headless-adapters/schemas/gemini-headless.json @@ -21,7 +21,7 @@ "result": { "matchPath": "$.type", "matchValue": "result", - "contentPath": "$.stats" + "contentPath": "$.content" }, "historyTemplate": "User: {{input}}\nAssistant: {{output}}" } diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index c7012ff..81b3564 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,4 +1,4 @@ -# Code owners for acp-harness repository +# Code owners for agent-eval-harness repository # These users will be automatically requested for review when someone opens a pull request. # See https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 2aeea54..ea55cde 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -16,14 +16,14 @@ jobs: permissions: pull-requests: read outputs: - acp: ${{ steps.filter.outputs.acp }} + src: ${{ steps.filter.outputs.src }} steps: - uses: actions/checkout@v4 - uses: dorny/paths-filter@v3 id: filter with: filters: | - acp: + src: - 'src/**' test-pr: @@ -46,14 +46,14 @@ jobs: # - GEMINI_API_KEY: API key for Gemini CLI integration tests test-integration: needs: changes - if: ${{ needs.changes.outputs.acp == 'true' }} + if: ${{ needs.changes.outputs.src == 'true' }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - - name: Run ACP integration tests + - name: Run integration tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} - run: docker compose -f docker-compose.test.yml run --rm acp-test + run: docker compose -f docker-compose.test.yml run --rm test diff --git a/.mcp.json b/.mcp.json index e4338db..b0520ac 100644 --- a/.mcp.json +++ b/.mcp.json @@ -1,16 +1,8 @@ { "mcpServers": { - "agent-skills-spec": { - "type": "http", - "url": "https://agentskills.io/mcp" - }, "bun-docs" : { "type": "http", "url": "https://bun.com/docs/mcp" - }, - "agent-client-protocol": { - "type": "http", - "url": "https://agentclientprotocol.com/mcp" } } } diff --git a/.plaited/rules/code-review.md b/.plaited/rules/code-review.md index db316cc..f0d9c5e 100644 --- a/.plaited/rules/code-review.md +++ b/.plaited/rules/code-review.md @@ -92,7 +92,7 @@ For functions with more than two parameters, use a single object parameter: ```typescript // ✅ Good: Object parameter pattern -const createClient = ({ +const createSessionManager = ({ command, timeout, cwd, @@ -100,14 +100,14 @@ const createClient = ({ command: string[] timeout: number cwd?: string -}): ACPClient => { /* ... */ } +}): SessionManager => { /* ... */ } // ❌ Avoid: Multiple positional parameters -const createClient = ( +const createSessionManager = ( command: string[], timeout: number, cwd?: string -): ACPClient => { /* ... */ } +): SessionManager => { /* ... */ } ``` **Exception - CLI Entry Points:** CLI functions take `args: string[]` because that's what the shell provides—parsing happens inside the function. This rule applies to internal APIs where callers pass typed values directly. diff --git a/.plaited/rules/module-organization.md b/.plaited/rules/module-organization.md index cf3dbb8..46b6582 100644 --- a/.plaited/rules/module-organization.md +++ b/.plaited/rules/module-organization.md @@ -10,11 +10,11 @@ Use named re-export files at the parent level, matching the folder name: ``` src/ -├── acp/ # Feature module -│ ├── acp.types.ts -│ ├── acp.schemas.ts -│ └── acp.ts # Main implementation -├── acp.ts # Re-exports public API from acp/ +├── capture/ # Feature module +│ ├── capture.types.ts +│ ├── capture.schemas.ts +│ └── capture.ts # Main implementation +├── capture.ts # Re-exports public API from capture/ ├── utils/ │ └── format.ts └── utils.ts # Re-exports public API from utils/ @@ -26,9 +26,9 @@ When a package has one primary feature, expose that re-export file directly as m ```json { - "main": "src/acp.ts", + "main": "src/harness.ts", "exports": { - ".": "./src/acp.ts", + ".": "./src/harness.ts", "./utils": "./src/utils.ts" } } @@ -43,7 +43,7 @@ Always include `.ts` extensions in imports. Bun runs TypeScript natively—no co ```typescript // ✅ Good import { Config } from './module.types.ts' -import { createClient } from '../acp/acp.ts' +import { createClient } from '../capture/capture.ts' // ❌ Avoid import { Config } from './module.types' diff --git a/.plaited/rules/testing.md b/.plaited/rules/testing.md index 3d4eead..a39fe3e 100644 --- a/.plaited/rules/testing.md +++ b/.plaited/rules/testing.md @@ -31,12 +31,12 @@ Use `test` instead of `it` in test files for consistency: ```typescript // ✅ Good -test('should create ACP client correctly', () => { +test('should create session manager correctly', () => { // ... }) // ❌ Avoid -it('should create ACP client correctly', () => { +it('should create session manager correctly', () => { // ... }) ``` diff --git a/AGENTS.md b/AGENTS.md index 14acc37..2d32284 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -19,21 +19,21 @@ bun run check:write bun test # Run Docker integration tests (requires API keys) -ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm acp-test +ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test ``` ## Quick Reference ### Package Overview -`@plaited/acp-harness` is a CLI tool for capturing agent trajectories from ACP-compatible agents. It executes prompts, captures full trajectories (tools, thoughts, plans), and outputs structured JSONL for downstream scoring. +`@plaited/agent-eval-harness` is a CLI tool for capturing agent trajectories from headless CLI agents. It executes prompts, captures full trajectories (tools, thoughts, plans), and outputs structured JSONL for downstream scoring. **CLI usage (with built-in headless adapter):** ```bash # Set API key and run capture with headless adapter (recommended) export ANTHROPIC_API_KEY=sk-... -bunx @plaited/acp-harness capture prompts.jsonl \ - bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json \ +bunx @plaited/agent-eval-harness capture prompts.jsonl \ + --schema .claude/skills/headless-adapters/schemas/claude-headless.json \ -o results.jsonl ``` @@ -53,39 +53,41 @@ bunx @plaited/acp-harness capture prompts.jsonl \ This project provides two AI agent skills in `.claude/skills/`: -### ACP Harness (`acp-harness`) +### Agent Eval Harness (`agent-eval-harness`) -CLI tool for capturing agent trajectories from ACP-compatible agents. +CLI tool for capturing agent trajectories from headless CLI agents. -**Commands:** `capture`, `trials`, `summarize`, `calibrate`, `validate-refs`, `balance`, `schemas` +**Core Commands:** `capture`, `trials`, `summarize`, `calibrate`, `validate-refs`, `balance`, `schemas` + +**Pipeline Commands (Unix-style):** `run`, `extract`, `grade`, `format`, `compare` **Use cases:** - Capturing trajectories for downstream evaluation - Generating training data (SFT/DPO) with full context - Building regression test fixtures for agent behavior +- Comparing agent responses across different configurations -See `.claude/skills/acp-harness/SKILL.md` for complete documentation. +See `.claude/skills/agent-eval-harness/SKILL.md` for complete documentation. -### ACP Adapters (`acp-adapters`) +### Headless Adapters (`headless-adapters`) -Discover, create, and validate ACP adapters for agent integration. +Discover, create, and validate headless adapters for agent integration. -**Commands:** `headless`, `adapter:scaffold`, `adapter:check` +**Commands:** `headless` **Use cases:** - Finding existing adapters for your agent - Wrapping headless CLI agents with schema-driven adapter -- Building custom ACP adapters from scratch -- Validating adapter ACP compliance +- Creating new schemas for CLI agents -See `.claude/skills/acp-adapters/SKILL.md` for complete documentation. +See `.claude/skills/headless-adapters/SKILL.md` for complete documentation. ### Installing Skills Install skills for AI coding agents: ```bash -curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project acp-harness +curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project agent-eval-harness ``` Replace `` with: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory` diff --git a/README.md b/README.md index c03da9d..1623206 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ -# @plaited/acp-harness +# @plaited/agent-eval-harness -[![npm version](https://img.shields.io/npm/v/@plaited/acp-harness.svg)](https://www.npmjs.com/package/@plaited/acp-harness) -[![CI](https://github.com/plaited/acp-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/acp-harness/actions/workflows/ci.yml) +[![npm version](https://img.shields.io/npm/v/@plaited/agent-eval-harness.svg)](https://www.npmjs.com/package/@plaited/agent-eval-harness) +[![CI](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml) [![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC) -CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents. +CLI tool for capturing agent trajectories from headless CLI agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents. ## CLI Tool @@ -13,59 +13,72 @@ Use these tools directly via the CLI without installation: ```bash # Using built-in headless adapter (recommended - no extra install needed) export ANTHROPIC_API_KEY=sk-... -bunx @plaited/acp-harness capture prompts.jsonl \ - bunx @plaited/acp-harness headless --schema ./schemas/claude-headless.json \ +bunx @plaited/agent-eval-harness capture prompts.jsonl \ + --schema ./schemas/claude-headless.json \ -o results.jsonl - -# Or with an external ACP adapter -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl ``` -**Prerequisite:** Set your API key. The `headless` command works with any CLI agent that supports JSON output - no adapter installation required: +**Prerequisite:** Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it: ```bash export ANTHROPIC_API_KEY=sk-... # For Claude export GEMINI_API_KEY=... # For Gemini ``` -Pre-built schemas are available in `.claude/skills/acp-adapters/schemas/` for Claude and Gemini. +Pre-built schemas are available in `.claude/skills/headless-adapters/schemas/` for Claude and Gemini. -### Commands +### Core Commands | Command | Description | |---------|-------------| -| `capture ` | Trajectory capture (full JSONL) | -| `trials ` | Multi-run with pass@k metrics | +| `capture --schema ` | Trajectory capture (full JSONL) | +| `trials --schema ` | Multi-run with pass@k metrics | | `summarize ` | Derive compact views from results | | `calibrate ` | Sample failures for review | | `validate-refs ` | Check reference solutions | | `balance ` | Analyze test set coverage | | `schemas [name]` | Export JSON schemas | | `headless --schema ` | Schema-driven adapter for any CLI agent | -| `adapter:check ` | Validate adapter ACP compliance | + +### Pipeline Commands (Unix-style) + +| Command | Description | +|---------|-------------| +| `run --schema ` | Execute prompts, output raw results | +| `extract --schema ` | Parse raw output into trajectories | +| `grade --grader ` | Apply grader to extracted results | +| `format --style