From 0011714ac23ad9f59975697dde57ae964aa2251f Mon Sep 17 00:00:00 2001 From: Edward Irby Date: Wed, 21 Jan 2026 13:12:55 -0800 Subject: [PATCH 01/13] refactor!: remove ACP layer, rename to agent-eval-harness MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BREAKING CHANGE: Package renamed from @plaited/acp-harness to @plaited/agent-eval-harness Major changes: - Remove ACP SDK dependency and all ACP protocol handling - Capture/trials now use headless session manager directly - Add debug mode (--debug) for verbose JSONPath matching output - Add exit code/signal tracking with ProcessExitInfo type - Add schema v2 support with timeout field Skill renames: - acp-harness → agent-eval-harness - acp-adapters → headless-adapters CLI changes: - capture/trials now require --schema flag (no positional agent command) - Remove adapter:check and adapter:scaffold commands Co-Authored-By: Claude Opus 4.5 --- .../SKILL.md | 90 +- .../assets/Dockerfile.acp | 8 +- .../assets/docker-compose.acp.yml | 10 +- .../references/docker-evals.md | 0 .../references/downstream.md | 24 +- .../references/eval-concepts.md | 10 +- .../references/graders.md | 8 +- .../references/output-formats.md | 12 +- .../SKILL.md | 83 +- .../references/schema-creation-guide.md | 25 +- .../references/troubleshooting-guide.md | 10 +- .../schemas/claude-headless.json | 0 .../schemas/gemini-headless.json | 0 AGENTS.md | 25 +- README.md | 6 - bin/cli.ts | 49 +- bin/tests/cli.spec.ts | 30 +- package.json | 15 +- src/acp-client.ts | 507 ---------- src/acp-helpers.ts | 121 --- src/acp-transport.ts | 462 --------- src/acp-utils.ts | 341 ------- src/acp.ts | 27 - src/adapter-check.ts | 541 ---------- src/adapter-scaffold.ts | 935 ------------------ src/capture.ts | 401 ++++---- src/headless-cli.ts | 24 +- src/headless-session-manager.ts | 143 ++- src/headless.schemas.ts | 95 +- src/headless.ts | 7 +- src/integration_tests/acp-claude.spec.ts | 170 ---- src/integration_tests/acp-gemini.spec.ts | 174 ---- src/schemas-cli.ts | 12 +- src/schemas.ts | 71 +- src/tests/acp-client.spec.ts | 205 ---- src/tests/acp-helpers.spec.ts | 105 -- src/tests/acp-transport.spec.ts | 153 --- src/tests/acp-utils.spec.ts | 394 -------- src/tests/adapter-check.spec.ts | 70 -- src/tests/adapter-scaffold.spec.ts | 112 --- src/tests/capture-cli.spec.ts | 14 +- src/tests/capture-helpers.spec.ts | 295 +----- src/tests/headless.spec.ts | 6 +- src/tests/schemas.spec.ts | 51 - src/tests/trials-cli.spec.ts | 16 +- src/trials.ts | 258 ++--- 46 files changed, 812 insertions(+), 5303 deletions(-) rename .claude/skills/{acp-harness => agent-eval-harness}/SKILL.md (79%) rename .claude/skills/{acp-harness => agent-eval-harness}/assets/Dockerfile.acp (59%) rename .claude/skills/{acp-harness => agent-eval-harness}/assets/docker-compose.acp.yml (52%) rename .claude/skills/{acp-harness => agent-eval-harness}/references/docker-evals.md (100%) rename .claude/skills/{acp-harness => agent-eval-harness}/references/downstream.md (91%) rename .claude/skills/{acp-harness => agent-eval-harness}/references/eval-concepts.md (93%) rename .claude/skills/{acp-harness => agent-eval-harness}/references/graders.md (93%) rename .claude/skills/{acp-harness => agent-eval-harness}/references/output-formats.md (94%) rename .claude/skills/{acp-adapters => headless-adapters}/SKILL.md (63%) rename .claude/skills/{acp-adapters => headless-adapters}/references/schema-creation-guide.md (92%) rename .claude/skills/{acp-adapters => headless-adapters}/references/troubleshooting-guide.md (97%) rename .claude/skills/{acp-adapters => headless-adapters}/schemas/claude-headless.json (100%) rename .claude/skills/{acp-adapters => headless-adapters}/schemas/gemini-headless.json (100%) delete mode 100644 src/acp-client.ts delete mode 100644 src/acp-helpers.ts delete mode 100644 src/acp-transport.ts delete mode 100644 src/acp-utils.ts delete mode 100644 src/acp.ts delete mode 100644 src/adapter-check.ts delete mode 100644 src/adapter-scaffold.ts delete mode 100644 src/integration_tests/acp-claude.spec.ts delete mode 100644 src/integration_tests/acp-gemini.spec.ts delete mode 100644 src/tests/acp-client.spec.ts delete mode 100644 src/tests/acp-helpers.spec.ts delete mode 100644 src/tests/acp-transport.spec.ts delete mode 100644 src/tests/acp-utils.spec.ts delete mode 100644 src/tests/adapter-check.spec.ts delete mode 100644 src/tests/adapter-scaffold.spec.ts diff --git a/.claude/skills/acp-harness/SKILL.md b/.claude/skills/agent-eval-harness/SKILL.md similarity index 79% rename from .claude/skills/acp-harness/SKILL.md rename to .claude/skills/agent-eval-harness/SKILL.md index 053d7d2..55ef314 100644 --- a/.claude/skills/acp-harness/SKILL.md +++ b/.claude/skills/agent-eval-harness/SKILL.md @@ -1,20 +1,20 @@ --- -name: acp-harness -description: CLI tool for capturing agent trajectories. Execute prompts against ACP-compatible agents, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. +name: agent-eval-harness +description: CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. compatibility: Bun >= 1.2.9 --- -# ACP Harness +# Agent Eval Harness ## Purpose -CLI tool for capturing trajectories from ACP-compatible agents, optimized for TypeScript/JavaScript projects using Bun. +CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun. **The harness captures. You score.** | Harness Provides | You Provide | |------------------|-------------| -| Prompt execution against ACP agents | Scoring logic (Braintrust, custom scripts) | +| Prompt execution via headless adapters | Scoring logic (Braintrust, custom scripts) | | Full trajectory capture (thoughts, tools, plans) | Pass/fail determination via graders | | Structured JSONL output | LLM-as-judge prompts | | Reproducible execution environment | CI integration, golden file comparison | @@ -29,10 +29,10 @@ CLI tool for capturing trajectories from ACP-compatible agents, optimized for Ty ```bash # Run without installing (recommended) -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Or install as project dependency -bun add @plaited/acp-harness +bun add @plaited/agent-eval-harness ``` ## Core Principle: Capture Once, Derive Many Views @@ -40,7 +40,7 @@ bun add @plaited/acp-harness ```mermaid flowchart LR Prompts["prompts.jsonl"] --> Capture["capture/trials"] - Agent["ACP Agent"] --> Capture + Schema["headless schema"] --> Capture Capture --> Results["results.jsonl (full trajectory)"] Results --> Summarize["summarize"] Results --> Calibrate["calibrate"] @@ -58,8 +58,8 @@ flowchart LR | Command | Input | Output | Purpose | |---------|-------|--------|---------| -| `capture` | prompts.jsonl + agent | results.jsonl | Trajectory capture (full) | -| `trials` | prompts.jsonl + agent | trials.jsonl | Multi-run + optional metrics | +| `capture` | prompts.jsonl + schema | results.jsonl | Trajectory capture (full) | +| `trials` | prompts.jsonl + schema | trials.jsonl | Multi-run + optional metrics | | `summarize` | results.jsonl | summary.jsonl or .md | Derive compact views | | `calibrate` | results.jsonl | calibration.md | Sample failures for review | | `validate-refs` | prompts.jsonl | validation.jsonl | Check reference solutions | @@ -73,7 +73,7 @@ All commands support optional `--grader ./grader.ts` for scoring. ### Basic Usage ```bash -bunx @plaited/acp-harness capture [args...] [options] +bunx @plaited/agent-eval-harness capture --schema [options] ``` ### Arguments @@ -81,25 +81,26 @@ bunx @plaited/acp-harness capture [args...] [options] | Argument/Flag | Description | Default | |------|-------------|---------| | `prompts.jsonl` | Input file with prompts to execute | Required | -| `command [args]` | ACP agent command (e.g., `bunx claude-code-acp`) | Required | +| `-s, --schema` | Path to headless adapter schema | Required | | `-o, --output` | Output file/path | stdout | -| `-c, --cwd` | Working directory for agent (agents auto-discover MCP configs from here) | current | +| `-c, --cwd` | Working directory for agent | current | | `-t, --timeout` | Request timeout in ms | `60000` | | `--progress` | Show progress to stderr | false | | `--append` | Append to output file | false | | `-g, --grader` | Path to grader module | none | +| `--debug` | Show detailed CLI output for debugging | false | ### Examples ```bash # Basic capture -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Using a local adapter script -bunx @plaited/acp-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl # With grader (adds score to each result) -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl ``` ## Trials Command @@ -108,10 +109,10 @@ Run each prompt multiple times for pass@k/pass^k analysis. ```bash # Capture only (no grader) -bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 -o trials.jsonl +bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl # With grader (computes pass@k, pass^k) -bunx @plaited/acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl ``` ### Output @@ -132,10 +133,10 @@ Derive compact views from full trajectory results. ```bash # Summary JSONL (for jq analysis) -bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl +bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl # Markdown (for LLM-as-judge) -bunx @plaited/acp-harness summarize results.jsonl --markdown -o results.md +bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md ``` ## Calibrate Command @@ -144,10 +145,10 @@ Sample failures for grader review. Calibration helps you distinguish between **a ```bash # Sample failures for human review -bunx @plaited/acp-harness calibrate results.jsonl --sample 10 -o calibration.md +bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md # Re-score with different grader to compare -bunx @plaited/acp-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md +bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md ``` See [eval-concepts.md](references/eval-concepts.md#grader-calibration) for why calibration matters. @@ -158,7 +159,7 @@ Check that reference solutions pass your grader before evaluating agents. ```bash # Validate reference solutions -bunx @plaited/acp-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl +bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl # Check for failures cat validation.jsonl | jq 'select(.pass == false)' @@ -193,10 +194,10 @@ Analyze test set coverage to ensure balanced evaluation. ```bash # Analyze prompt distribution -bunx @plaited/acp-harness balance prompts.jsonl -o balance.json +bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json # Pretty print -bunx @plaited/acp-harness balance prompts.jsonl | jq . +bunx @plaited/agent-eval-harness balance prompts.jsonl | jq . ``` ### Why Use This? @@ -241,15 +242,15 @@ Export JSON schemas for non-TypeScript tools. ```bash # List available schemas -bunx @plaited/acp-harness schemas +bunx @plaited/agent-eval-harness schemas # Export all schemas as JSON -bunx @plaited/acp-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Export specific schema -bunx @plaited/acp-harness schemas CaptureResult --json -bunx @plaited/acp-harness schemas TrialResult --json -bunx @plaited/acp-harness schemas GraderResult --json +bunx @plaited/agent-eval-harness schemas CaptureResult --json +bunx @plaited/agent-eval-harness schemas TrialResult --json +bunx @plaited/agent-eval-harness schemas GraderResult --json ``` ### Available Schemas @@ -269,7 +270,7 @@ Export schemas for validation in Python, Go, etc.: ```bash # Export all schemas -bunx @plaited/acp-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Use in Python with jsonschema python -c " @@ -295,7 +296,7 @@ Graders provide semantic pass/fail scoring for captured trajectories. The harnes ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') @@ -331,7 +332,7 @@ print(json.dumps({ ```bash chmod +x ./grader.py -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl ``` See [graders.md](references/graders.md) for complete polyglot grader documentation including shell scripts and LLM-as-judge patterns. @@ -375,7 +376,7 @@ Full trajectory JSONL (always): ], "metadata": { "category": "search", - "agent": "bunx claude-code-acp", + "agent": "--schema ./claude.json", "trajectoryRichness": "full", "turnCount": 1 }, @@ -413,7 +414,7 @@ Full trajectory JSONL (always): Consumers can import Zod schemas directly: ```typescript -import { CaptureResultSchema, TrialResultSchema } from '@plaited/acp-harness/schemas' +import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas' // Validate external data const result = CaptureResultSchema.parse(jsonData) @@ -426,8 +427,8 @@ const jsonSchema = z.toJSONSchema(CaptureResultSchema) Or export JSON schemas for non-TypeScript tools: ```bash -bunx @plaited/acp-harness schemas --json -o schemas.json -bunx @plaited/acp-harness schemas CaptureResult --json +bunx @plaited/agent-eval-harness schemas --json -o schemas.json +bunx @plaited/agent-eval-harness schemas CaptureResult --json ``` ## Execution Environment @@ -465,13 +466,13 @@ Run with the headless adapter: ```bash # Using Claude Code via headless adapter -bunx @plaited/acp-harness capture multi-turn.jsonl \ - bunx @plaited/acp-harness headless --schema ./claude-headless.json \ +bunx @plaited/agent-eval-harness capture multi-turn.jsonl \ + bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \ -o results.jsonl # Using Gemini CLI via headless adapter -GEMINI_API_KEY=... bunx @plaited/acp-harness capture multi-turn.jsonl \ - bunx @plaited/acp-harness headless --schema ./gemini-headless.json \ +GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \ + bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \ -o results.jsonl ``` @@ -493,7 +494,7 @@ cat results.jsonl | jq 'select(.metadata.category == "ui")' cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add' # Summarize for quick analysis -bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl +bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl ``` See [downstream.md](references/downstream.md) for integration patterns with Braintrust, Gemini, and custom scorers. @@ -502,7 +503,7 @@ See [downstream.md](references/downstream.md) for integration patterns with Brai | Resource | Description | |----------|-------------| -| `bunx @plaited/acp-harness` | CLI help | +| `bunx @plaited/agent-eval-harness` | CLI help | | [output-formats.md](references/output-formats.md) | JSONL schemas, command details | | [downstream.md](references/downstream.md) | Integration patterns (Braintrust, jq, custom scorers) | | [graders.md](references/graders.md) | Polyglot grader documentation (TypeScript, Python, shell) | @@ -511,5 +512,4 @@ See [downstream.md](references/downstream.md) for integration patterns with Brai ## Related -- **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK for programmatic access -- **[@zed-industries/claude-code-acp](https://www.npmjs.com/package/@zed-industries/claude-code-acp)** - Claude Code ACP adapter +- **[headless-adapters skill](../headless-adapters/SKILL.md)** - Schema-driven adapters for headless CLI agents diff --git a/.claude/skills/acp-harness/assets/Dockerfile.acp b/.claude/skills/agent-eval-harness/assets/Dockerfile.acp similarity index 59% rename from .claude/skills/acp-harness/assets/Dockerfile.acp rename to .claude/skills/agent-eval-harness/assets/Dockerfile.acp index 14e6a8a..f1241e5 100644 --- a/.claude/skills/acp-harness/assets/Dockerfile.acp +++ b/.claude/skills/agent-eval-harness/assets/Dockerfile.acp @@ -1,11 +1,11 @@ -# ACP Harness Docker Configuration +# Agent Eval Harness Docker Configuration # -# Example Dockerfile for running ACP evaluations in an isolated container. +# Example Dockerfile for running agent evaluations in an isolated container. # Copy this to your project and customize as needed. # # Usage: -# docker build -f Dockerfile.acp -t acp-harness . -# docker run --rm -e ANTHROPIC_API_KEY acp-harness bunx @plaited/acp-harness prompts.jsonl +# docker build -f Dockerfile.acp -t agent-eval-harness . +# docker run --rm -e ANTHROPIC_API_KEY agent-eval-harness bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json FROM oven/bun:1.2.9 diff --git a/.claude/skills/acp-harness/assets/docker-compose.acp.yml b/.claude/skills/agent-eval-harness/assets/docker-compose.acp.yml similarity index 52% rename from .claude/skills/acp-harness/assets/docker-compose.acp.yml rename to .claude/skills/agent-eval-harness/assets/docker-compose.acp.yml index bc867dc..f41cd6a 100644 --- a/.claude/skills/acp-harness/assets/docker-compose.acp.yml +++ b/.claude/skills/agent-eval-harness/assets/docker-compose.acp.yml @@ -1,13 +1,13 @@ -# ACP Harness Docker Compose Configuration +# Agent Eval Harness Docker Compose Configuration # -# Example docker-compose for running ACP evaluations. +# Example docker-compose for running agent evaluations. # Copy this to your project and customize as needed. # # Usage: -# ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.acp.yml run --rm acp-harness +# ANTHROPIC_API_KEY=sk-... docker compose -f docker-compose.acp.yml run --rm agent-eval-harness services: - acp-harness: + agent-eval-harness: build: context: . dockerfile: Dockerfile.acp @@ -16,4 +16,4 @@ services: volumes: # Mount output directory to persist results - ./results:/app/results - command: ["bunx", "@plaited/acp-harness", "prompts.jsonl", "-o", "results/output.jsonl"] + command: ["bunx", "@plaited/agent-eval-harness", "capture", "prompts.jsonl", "--schema", "./claude.json", "-o", "results/output.jsonl"] diff --git a/.claude/skills/acp-harness/references/docker-evals.md b/.claude/skills/agent-eval-harness/references/docker-evals.md similarity index 100% rename from .claude/skills/acp-harness/references/docker-evals.md rename to .claude/skills/agent-eval-harness/references/docker-evals.md diff --git a/.claude/skills/acp-harness/references/downstream.md b/.claude/skills/agent-eval-harness/references/downstream.md similarity index 91% rename from .claude/skills/acp-harness/references/downstream.md rename to .claude/skills/agent-eval-harness/references/downstream.md index 691337c..b9160dd 100644 --- a/.claude/skills/acp-harness/references/downstream.md +++ b/.claude/skills/agent-eval-harness/references/downstream.md @@ -21,7 +21,7 @@ Use `summarize` command for quick jq analysis: ```bash # First derive summary -acp-harness summarize results.jsonl -o summary.jsonl +agent-eval-harness summarize results.jsonl -o summary.jsonl # Calculate average duration cat summary.jsonl | jq -s 'map(.duration) | add / length' @@ -172,7 +172,7 @@ Use the `--grader` flag to add scoring to capture results. The harness supports ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') @@ -185,7 +185,7 @@ export const grade: Grader = async ({ input, output, hint, trajectory }) => { ``` ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl ``` ### Python Grader @@ -228,7 +228,7 @@ print(json.dumps({ ```bash chmod +x ./grader.py -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl ``` ### Detection Logic @@ -301,7 +301,7 @@ Use markdown summary for smaller context: import Anthropic from '@anthropic-ai/sdk' // Generate markdown summary first -await Bun.$`acp-harness summarize results.jsonl --markdown -o results.md` +await Bun.$`agent-eval-harness summarize results.jsonl --markdown -o results.md` const client = new Anthropic() const markdown = await Bun.file('results.md').text() @@ -376,21 +376,21 @@ jobs: run: npm install -g @zed-industries/claude-code-acp - name: Install dependencies - run: bun add @plaited/acp-harness + run: bun add @plaited/agent-eval-harness - name: Run harness env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | - bunx @plaited/acp-harness capture prompts.jsonl \ + bunx @plaited/agent-eval-harness capture prompts.jsonl \ bunx claude-code-acp \ --progress \ -o results.jsonl - name: Generate summary run: | - bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl - bunx @plaited/acp-harness summarize results.jsonl --markdown -o results.md + bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl + bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md - name: Upload results uses: actions/upload-artifact@v4 @@ -408,8 +408,8 @@ Combine multiple runs: ```bash # Append mode during runs -acp-harness capture prompts-1.jsonl bunx claude-code-acp --append -o combined.jsonl -acp-harness capture prompts-2.jsonl bunx claude-code-acp --append -o combined.jsonl +agent-eval-harness capture prompts-1.jsonl bunx claude-code-acp --append -o combined.jsonl +agent-eval-harness capture prompts-2.jsonl bunx claude-code-acp --append -o combined.jsonl # Merge separate files cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl @@ -422,7 +422,7 @@ cat run1.jsonl run2.jsonl run3.jsonl > combined.jsonl Use the `trials` command to measure pass@k/pass^k: ```bash -acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +agent-eval-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl ``` ```typescript diff --git a/.claude/skills/acp-harness/references/eval-concepts.md b/.claude/skills/agent-eval-harness/references/eval-concepts.md similarity index 93% rename from .claude/skills/acp-harness/references/eval-concepts.md rename to .claude/skills/agent-eval-harness/references/eval-concepts.md index 5060f78..e5d6f70 100644 --- a/.claude/skills/acp-harness/references/eval-concepts.md +++ b/.claude/skills/agent-eval-harness/references/eval-concepts.md @@ -55,7 +55,7 @@ Run a prompt 5 times, 3 pass (60% raw pass rate): ```bash # Run many trials to assess capability -acp-harness trials new-prompts.jsonl bunx agent -k 10 --grader ./grader.ts -o capability.jsonl +agent-eval-harness trials new-prompts.jsonl bunx agent -k 10 --grader ./grader.ts -o capability.jsonl # Analyze results cat capability.jsonl | jq 'select(.passAtK > 0.9) | {id, passAtK}' @@ -70,7 +70,7 @@ Questions answered: ```bash # Run fewer trials for known-good tasks -acp-harness trials regression-suite.jsonl bunx agent -k 3 --grader ./grader.ts -o regression.jsonl +agent-eval-harness trials regression-suite.jsonl bunx agent -k 3 --grader ./grader.ts -o regression.jsonl # Fail CI if reliability drops cat regression.jsonl | jq -e 'all(.passExpK > 0.8)' @@ -118,7 +118,7 @@ Week 3: 60% grader pass → 80% actually correct (grader rejected 20% valid) ```bash # Sample 10 failures for human review -acp-harness calibrate results.jsonl --sample 10 -o calibration.md +agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md ``` Review the markdown output and label each sample: @@ -171,7 +171,7 @@ Reference solutions prove a task is solvable before blaming the agent. ```bash # Check that reference solutions pass your grader -acp-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl +agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl # If references fail, your grader or task is broken cat validation.jsonl | jq 'select(.pass == false)' @@ -199,7 +199,7 @@ An eval with only "make X work" misses "don't break Y". ### Using the Balance Command ```bash -acp-harness balance prompts.jsonl -o balance.json +agent-eval-harness balance prompts.jsonl -o balance.json ``` Analyzes: diff --git a/.claude/skills/acp-harness/references/graders.md b/.claude/skills/agent-eval-harness/references/graders.md similarity index 93% rename from .claude/skills/acp-harness/references/graders.md rename to .claude/skills/agent-eval-harness/references/graders.md index a2457e0..fc187af 100644 --- a/.claude/skills/acp-harness/references/graders.md +++ b/.claude/skills/agent-eval-harness/references/graders.md @@ -17,7 +17,7 @@ Export a `grade` function matching the `Grader` type: ```typescript // my-grader.ts -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { // Your scoring logic @@ -32,7 +32,7 @@ export const grade: Grader = async ({ input, output, hint, trajectory }) => { **Usage:** ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./my-grader.ts -o results.jsonl ``` ## Python Grader @@ -69,7 +69,7 @@ print(json.dumps({ **Usage:** ```bash chmod +x ./grader.py -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl ``` ## Executable Protocol @@ -141,7 +141,7 @@ Wrap an LLM call in your grader for semantic evaluation: ```typescript // llm-judge.ts import Anthropic from '@anthropic-ai/sdk' -import type { Grader } from '@plaited/acp-harness/schemas' +import type { Grader } from '@plaited/agent-eval-harness/schemas' const client = new Anthropic() diff --git a/.claude/skills/acp-harness/references/output-formats.md b/.claude/skills/agent-eval-harness/references/output-formats.md similarity index 94% rename from .claude/skills/acp-harness/references/output-formats.md rename to .claude/skills/agent-eval-harness/references/output-formats.md index 8cab355..a73b65e 100644 --- a/.claude/skills/acp-harness/references/output-formats.md +++ b/.claude/skills/agent-eval-harness/references/output-formats.md @@ -7,7 +7,7 @@ The harness uses a "capture once, derive many views" approach. The `capture` com The `capture` command always outputs full trajectory JSONL: ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl ``` ### Schema @@ -63,7 +63,7 @@ type GraderResult = { The `summarize` command derives compact JSONL from full trajectory: ```bash -acp-harness summarize results.jsonl -o summary.jsonl +agent-eval-harness summarize results.jsonl -o summary.jsonl ``` ### Schema @@ -103,7 +103,7 @@ cat summary.jsonl | jq 'select(.output | contains("error"))' The `summarize` command can also produce markdown for LLM-as-judge workflows: ```bash -acp-harness summarize results.jsonl --markdown -o results.md +agent-eval-harness summarize results.jsonl --markdown -o results.md ``` ### Structure @@ -147,7 +147,7 @@ acp-harness summarize results.jsonl --markdown -o results.md The `trials` command produces per-prompt trial results: ```bash -acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl +agent-eval-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl ``` ### Schema @@ -224,7 +224,7 @@ The `toolErrors` field indicates whether any tool calls failed during execution: **Note:** `toolErrors` only indicates tool-level failures. For semantic pass/fail (did the agent accomplish the task?), use a grader: ```bash -acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl ``` ## Input Format @@ -250,7 +250,7 @@ All commands stream output line-by-line as results complete: ```bash # Watch results in real-time -acp-harness capture prompts.jsonl bunx claude-code-acp --progress -o results.jsonl & +agent-eval-harness capture prompts.jsonl bunx claude-code-acp --progress -o results.jsonl & tail -f results.jsonl ``` diff --git a/.claude/skills/acp-adapters/SKILL.md b/.claude/skills/headless-adapters/SKILL.md similarity index 63% rename from .claude/skills/acp-adapters/SKILL.md rename to .claude/skills/headless-adapters/SKILL.md index 0db186a..7242d8e 100644 --- a/.claude/skills/acp-adapters/SKILL.md +++ b/.claude/skills/headless-adapters/SKILL.md @@ -1,10 +1,10 @@ --- -name: acp-adapters -description: Discover, create, and validate ACP adapters for agent integration. Includes scaffolding tools and compliance testing for the Agent Client Protocol. +name: headless-adapters +description: Discover, create, and validate headless adapters for agent integration. Includes scaffolding tools and compliance testing for the Agent Client Protocol. compatibility: Bun >= 1.2.9 --- -# ACP Adapters +# Headless Adapters ## Purpose @@ -13,7 +13,6 @@ Schema-driven adapter for headless CLI agents. **No code required** - just defin | Use Case | Tool | |----------|------| | Wrap headless CLI agent | `headless` command | -| Verify implementation | `adapter:check` command | | Create new schemas | [Schema Creation Guide](references/schema-creation-guide.md) | ## Quick Start @@ -21,21 +20,17 @@ Schema-driven adapter for headless CLI agents. **No code required** - just defin 1. **Check if a schema exists** in [schemas/](schemas/) 2. **Run the adapter:** ```bash - ANTHROPIC_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json - ``` -3. **Validate compliance:** - ```bash - bunx @plaited/acp-harness adapter:check bunx @plaited/acp-harness headless --schema ./my-schema.json + ANTHROPIC_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/claude-headless.json ``` ## CLI Commands ### headless -Schema-driven ACP adapter for ANY headless CLI agent. +Schema-driven adapter for ANY headless CLI agent. ```bash -bunx @plaited/acp-harness headless --schema +bunx @plaited/agent-eval-harness headless --schema ``` **Options:** @@ -82,46 +77,19 @@ Both modes support multi-turn conversations. Send multiple prompts to the same s ```typescript // Create one session, send multiple prompts -const session = await client.createSession({ cwd: PROJECT_ROOT }) +const session = await manager.createSession({ cwd: PROJECT_ROOT }) // Turn 1 -const { updates: turn1 } = await client.promptSync(session.id, createPrompt('Remember: 42')) +const turn1 = await manager.prompt(session.id, 'Remember: 42') // Turn 2 - context is maintained -const { updates: turn2 } = await client.promptSync(session.id, createPrompt('What number?')) +const turn2 = await manager.prompt(session.id, 'What number?') ``` How context is preserved: - **stream mode:** Process stays alive, CLI maintains internal state - **iterative mode:** Adapter builds history using `historyTemplate` from schema ---- - -### adapter:check - -Validate that an adapter implements the ACP protocol correctly. - -```bash -bunx @plaited/acp-harness adapter:check [args...] -``` - -**Options:** -| Flag | Description | Default | -|------|-------------|---------| -| `--timeout` | Timeout for each check in ms | `5000` | -| `--verbose` | Show detailed protocol messages | false | - -**Checks Performed:** - -| Check | Description | -|-------|-------------| -| `spawn` | Adapter can be launched as subprocess | -| `initialize` | Responds to initialize with valid `agentCapabilities` | -| `session/new` | Creates session and returns `sessionId` | -| `session/prompt` | Accepts prompt and emits `session/update` notifications | -| `session/cancel` | Accepts cancel notification gracefully | -| `framing` | All messages are newline-delimited JSON-RPC 2.0 | - ## Pre-built Schemas Tested schemas are available in [schemas/](schemas/): @@ -134,10 +102,10 @@ Tested schemas are available in [schemas/](schemas/): **Usage:** ```bash # Claude Code -ANTHROPIC_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json +ANTHROPIC_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/claude-headless.json # Gemini CLI -GEMINI_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/gemini-headless.json +GEMINI_API_KEY=... bunx @plaited/agent-eval-harness headless --schema .claude/skills/headless-adapters/schemas/gemini-headless.json ``` ## Agents with Headless CLI Support @@ -160,10 +128,9 @@ GEMINI_API_KEY=... bunx @plaited/acp-harness headless --schema .claude/skills/ac 1. Explore the CLI's `--help` to identify prompt, output, and auto-approve flags 2. Capture sample JSON output from the CLI -3. Map JSONPath patterns to ACP events +3. Map JSONPath patterns to output events 4. Create schema file based on an existing template 5. Test with `headless` command -6. Validate with `adapter:check` See [Schema Creation Guide](references/schema-creation-guide.md) for the complete workflow. @@ -179,7 +146,7 @@ See [Schema Creation Guide](references/schema-creation-guide.md) for the complet | Timeout on prompt | JSONPath not matching | Capture raw CLI output, verify paths - [see guide](references/troubleshooting-guide.md#jsonpath-debugging) | | Empty responses | Content extraction failing | Check extract paths - [see guide](references/troubleshooting-guide.md#output-event-matching) | -**📖 Complete troubleshooting documentation:** [Troubleshooting Guide](references/troubleshooting-guide.md) +**Complete troubleshooting documentation:** [Troubleshooting Guide](references/troubleshooting-guide.md) This guide includes: - Detailed debugging steps for each issue @@ -197,33 +164,13 @@ This guide includes: 2. **Test headless adapter directly:** ```bash - printf '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":1}}\n' | \ - bunx @plaited/acp-harness headless --schema ./my-schema.json - ``` - -3. **Run adapter:check for diagnostics:** - ```bash - bunx @plaited/acp-harness adapter:check \ - bunx @plaited/acp-harness headless --schema ./my-schema.json --verbose + bunx @plaited/agent-eval-harness headless --schema ./my-schema.json -p "Hello" ``` ## External Resources -- **ACP-Compatible Agents**: [agentclientprotocol.com/overview/agents](https://agentclientprotocol.com/overview/agents) - **AgentSkills Spec**: [agentskills.io](https://agentskills.io) -- **ACP Protocol Docs**: Use the MCP server for protocol questions: - ```json - { - "mcpServers": { - "agent-client-protocol-docs": { - "type": "http", - "url": "https://agentclientprotocol.com/mcp" - } - } - } - ``` ## Related -- **[acp-harness skill](../acp-harness/SKILL.md)** - Running evaluations against adapters -- **[@agentclientprotocol/sdk](https://www.npmjs.com/package/@agentclientprotocol/sdk)** - ACP SDK with TypeScript types +- **[agent-eval-harness skill](../agent-eval-harness/SKILL.md)** - Running evaluations against adapters diff --git a/.claude/skills/acp-adapters/references/schema-creation-guide.md b/.claude/skills/headless-adapters/references/schema-creation-guide.md similarity index 92% rename from .claude/skills/acp-adapters/references/schema-creation-guide.md rename to .claude/skills/headless-adapters/references/schema-creation-guide.md index 40fdc12..d6004fb 100644 --- a/.claude/skills/acp-adapters/references/schema-creation-guide.md +++ b/.claude/skills/headless-adapters/references/schema-creation-guide.md @@ -14,8 +14,7 @@ flowchart TD B --> C["3. Capture Sample Output"] C --> D["4. Map JSONPath Patterns"] D --> E["5. Create Schema File"] - E --> F["6. Test with Headless"] - F --> G["7. Validate with adapter:check"] + E --> F["6. Test with Debug Mode"] ``` ### Step 1: Explore CLI Help @@ -104,7 +103,7 @@ Use an existing schema as a template: ```bash # Copy from tested schema -cp .claude/skills/acp-adapters/schemas/claude-headless.json ./my-agent-headless.json +cp .claude/skills/headless-adapters/schemas/claude-headless.json ./my-agent-headless.json ``` Modify for your agent: @@ -151,25 +150,21 @@ Run the headless adapter with your schema: ```bash # Test the adapter -AGENT_API_KEY=... bunx @plaited/acp-harness headless --schema ./my-agent-headless.json +AGENT_API_KEY=... bunx @plaited/agent-eval-harness headless --schema ./my-agent-headless.json ``` -### Step 7: Validate with adapter:check +### Step 6: Test with Debug Mode -Verify ACP compliance: +Use debug mode to verify JSONPath extraction: ```bash -bunx @plaited/acp-harness adapter:check \ - bunx @plaited/acp-harness headless --schema ./my-agent-headless.json +AGENT_API_KEY=... bunx @plaited/agent-eval-harness headless --schema ./my-agent-headless.json --debug ``` -All 6 checks should pass: -- `spawn` - Adapter launches -- `initialize` - Protocol handshake works -- `session/new` - Session creation works -- `session/prompt` - Prompt handling works -- `session/cancel` - Cancel is acknowledged -- `framing` - Valid JSON-RPC framing +Debug mode shows: +- Raw CLI output lines +- JSONPath match attempts +- Extracted values for each event ## Schema Field Reference diff --git a/.claude/skills/acp-adapters/references/troubleshooting-guide.md b/.claude/skills/headless-adapters/references/troubleshooting-guide.md similarity index 97% rename from .claude/skills/acp-adapters/references/troubleshooting-guide.md rename to .claude/skills/headless-adapters/references/troubleshooting-guide.md index 0e00d1f..03d1c43 100644 --- a/.claude/skills/acp-adapters/references/troubleshooting-guide.md +++ b/.claude/skills/headless-adapters/references/troubleshooting-guide.md @@ -184,7 +184,7 @@ Use `stdin: true` when: Some CLIs have the output format embedded in the base command (e.g., `codex exec --json`), so the schema doesn't need separate `output.flag` and `output.value` fields. However, the `output` field is required by the adapter schema. -Before acp-harness 0.4.3, specifying empty output values would add two empty strings as command arguments: +Before agent-eval-harness 0.4.3, specifying empty output values would add two empty strings as command arguments: ```bash # Schema with empty output @@ -199,9 +199,9 @@ codex exec --json - "" "" ### Solution -**acp-harness 0.4.3+:** Empty output flags are automatically skipped - no changes needed. +**agent-eval-harness 0.4.3+:** Empty output flags are automatically skipped - no changes needed. -**acp-harness 0.4.2 and earlier:** Use a workaround by putting the output format in the command array: +**agent-eval-harness 0.4.2 and earlier:** Use a workaround by putting the output format in the command array: ```json { @@ -217,7 +217,7 @@ Even though Codex doesn't use these flags, this prevents empty strings from bein Use empty `output.flag` and `output.value` when: - CLI has output format embedded in command - No additional flags needed for JSON output -- acp-harness version is 0.4.3 or later +- agent-eval-harness version is 0.4.3 or later **Examples:** - Codex: `codex exec --json` (format is in command) @@ -566,7 +566,7 @@ The `result` configuration marks when the agent is done: 1. **Check if events are being matched at all:** ```bash # Run adapter:check with verbose mode - adapter:check --verbose -- bunx @plaited/acp-harness headless --schema schema.json + bunx @plaited/agent-eval-harness headless --schema schema.json --debug ``` 2. **Verify JSON structure matches your paths:** diff --git a/.claude/skills/acp-adapters/schemas/claude-headless.json b/.claude/skills/headless-adapters/schemas/claude-headless.json similarity index 100% rename from .claude/skills/acp-adapters/schemas/claude-headless.json rename to .claude/skills/headless-adapters/schemas/claude-headless.json diff --git a/.claude/skills/acp-adapters/schemas/gemini-headless.json b/.claude/skills/headless-adapters/schemas/gemini-headless.json similarity index 100% rename from .claude/skills/acp-adapters/schemas/gemini-headless.json rename to .claude/skills/headless-adapters/schemas/gemini-headless.json diff --git a/AGENTS.md b/AGENTS.md index 14acc37..1b03a5b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -26,14 +26,14 @@ ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.tes ### Package Overview -`@plaited/acp-harness` is a CLI tool for capturing agent trajectories from ACP-compatible agents. It executes prompts, captures full trajectories (tools, thoughts, plans), and outputs structured JSONL for downstream scoring. +`@plaited/agent-eval-harness` is a CLI tool for capturing agent trajectories from headless CLI agents. It executes prompts, captures full trajectories (tools, thoughts, plans), and outputs structured JSONL for downstream scoring. **CLI usage (with built-in headless adapter):** ```bash # Set API key and run capture with headless adapter (recommended) export ANTHROPIC_API_KEY=sk-... -bunx @plaited/acp-harness capture prompts.jsonl \ - bunx @plaited/acp-harness headless --schema .claude/skills/acp-adapters/schemas/claude-headless.json \ +bunx @plaited/agent-eval-harness capture prompts.jsonl \ + --schema .claude/skills/headless-adapters/schemas/claude-headless.json \ -o results.jsonl ``` @@ -53,9 +53,9 @@ bunx @plaited/acp-harness capture prompts.jsonl \ This project provides two AI agent skills in `.claude/skills/`: -### ACP Harness (`acp-harness`) +### Agent Eval Harness (`agent-eval-harness`) -CLI tool for capturing agent trajectories from ACP-compatible agents. +CLI tool for capturing agent trajectories from headless CLI agents. **Commands:** `capture`, `trials`, `summarize`, `calibrate`, `validate-refs`, `balance`, `schemas` @@ -64,28 +64,27 @@ CLI tool for capturing agent trajectories from ACP-compatible agents. - Generating training data (SFT/DPO) with full context - Building regression test fixtures for agent behavior -See `.claude/skills/acp-harness/SKILL.md` for complete documentation. +See `.claude/skills/agent-eval-harness/SKILL.md` for complete documentation. -### ACP Adapters (`acp-adapters`) +### Headless Adapters (`headless-adapters`) -Discover, create, and validate ACP adapters for agent integration. +Discover, create, and validate headless adapters for agent integration. -**Commands:** `headless`, `adapter:scaffold`, `adapter:check` +**Commands:** `headless` **Use cases:** - Finding existing adapters for your agent - Wrapping headless CLI agents with schema-driven adapter -- Building custom ACP adapters from scratch -- Validating adapter ACP compliance +- Creating new schemas for CLI agents -See `.claude/skills/acp-adapters/SKILL.md` for complete documentation. +See `.claude/skills/headless-adapters/SKILL.md` for complete documentation. ### Installing Skills Install skills for AI coding agents: ```bash -curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project acp-harness +curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project agent-eval-harness ``` Replace `` with: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory` diff --git a/README.md b/README.md index c03da9d..d97dc0a 100644 --- a/README.md +++ b/README.md @@ -78,12 +78,6 @@ curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/insta Replace `` with your agent: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory` -**Update skills:** - -```bash -curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- update --agent --project acp-harness -``` - ### Available Skills #### ACP Harness diff --git a/bin/cli.ts b/bin/cli.ts index 7d09ac5..18a221e 100644 --- a/bin/cli.ts +++ b/bin/cli.ts @@ -1,7 +1,7 @@ #!/usr/bin/env bun /** - * ACP Harness CLI - Agent evaluation toolkit. + * Agent Eval Harness CLI - Agent evaluation toolkit. * * @remarks * Router for harness commands. Thin wrapper that delegates to command modules. @@ -15,12 +15,8 @@ * - balance: Analyze test set coverage * - schemas: Export JSON schemas for non-TS users * - headless: Schema-driven adapter for any headless CLI agent - * - adapter:scaffold: Scaffold new ACP adapter project - * - adapter:check: Validate adapter ACP compliance */ -import { adapterCheck } from '../src/adapter-check.ts' -import { adapterScaffold } from '../src/adapter-scaffold.ts' import { balance } from '../src/balance.ts' import { calibrate } from '../src/calibrate.ts' import { capture } from '../src/capture.ts' @@ -35,10 +31,10 @@ const [command, ...args] = Bun.argv.slice(2) const printHelp = () => { // biome-ignore lint/suspicious/noConsole: CLI help output console.log(` -acp-harness - CLI tool for agent evaluation +agent-eval-harness - CLI tool for agent evaluation Commands: - capture Capture trajectories from ACP agent + capture Capture trajectories from CLI agents trials Run prompts multiple times for pass@k/pass^k metrics summarize Derive compact views from results calibrate Sample failures for grader review @@ -46,40 +42,27 @@ Commands: balance Analyze test set coverage schemas Export JSON schemas for non-TypeScript users headless Schema-driven adapter for any headless CLI agent - adapter:scaffold Scaffold a new ACP adapter project - adapter:check Validate adapter ACP compliance -Run 'acp-harness --help' for command-specific help. +Run 'agent-eval-harness --help' for command-specific help. Examples: - # Basic capture - acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl + # Basic capture with schema + agent-eval-harness capture prompts.jsonl --schema claude.json -o results.jsonl # With grader - acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl + agent-eval-harness capture prompts.jsonl -s claude.json --grader ./grader.ts -o results.jsonl # Multi-run trials - acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl + agent-eval-harness trials prompts.jsonl -s claude.json -k 5 --grader ./grader.ts -o trials.jsonl # Derive summary view - acp-harness summarize results.jsonl -o summary.jsonl + agent-eval-harness summarize results.jsonl -o summary.jsonl # Export schemas - acp-harness schemas --json -o schemas.json - - # Scaffold new adapter - acp-harness adapter:scaffold my-agent -o ./adapters/my-agent - - # Validate adapter compliance - acp-harness adapter:check bun ./my-adapter/src/main.ts + agent-eval-harness schemas --json -o schemas.json # Run headless adapter with schema - acp-harness headless --schema ./claude-headless.json - - # Capture with headless adapter - acp-harness capture prompts.jsonl \\ - acp-harness headless --schema ./claude-headless.json \\ - -o results.jsonl + agent-eval-harness headless --schema ./claude-headless.json Documentation: https://github.com/plaited/acp-harness `) @@ -119,14 +102,6 @@ const main = async () => { await headless(args) break - case 'adapter:scaffold': - await adapterScaffold(args) - break - - case 'adapter:check': - await adapterCheck(args) - break - case '-h': case '--help': case undefined: @@ -143,7 +118,7 @@ const main = async () => { default: console.error(`Unknown command: ${command}`) - console.error("Run 'acp-harness --help' for usage") + console.error("Run 'agent-eval-harness --help' for usage") process.exit(1) } } diff --git a/bin/tests/cli.spec.ts b/bin/tests/cli.spec.ts index 8a6d371..c31c0db 100644 --- a/bin/tests/cli.spec.ts +++ b/bin/tests/cli.spec.ts @@ -3,11 +3,11 @@ import { join } from 'node:path' import { z } from 'zod' /** - * Tests for the acp-harness CLI. + * Tests for the agent-eval-harness CLI. * * @remarks * Tests CLI argument parsing, help output, and output format schemas. - * Integration tests requiring an actual ACP agent are in *.docker.ts files. + * Integration tests requiring an actual CLI agent are in *.docker.ts files. */ const CLI_PATH = join(import.meta.dir, '..', 'cli.ts') @@ -26,7 +26,7 @@ describe('CLI invocation', () => { const exitCode = await proc.exited expect(exitCode).toBe(0) - expect(stdout).toContain('acp-harness') + expect(stdout).toContain('agent-eval-harness') expect(stdout).toContain('Commands:') expect(stdout).toContain('capture') expect(stdout).toContain('trials') @@ -42,7 +42,7 @@ describe('CLI invocation', () => { const exitCode = await proc.exited expect(exitCode).toBe(0) - expect(stdout).toContain('acp-harness') + expect(stdout).toContain('agent-eval-harness') }) test('shows help when no arguments provided', async () => { @@ -54,7 +54,7 @@ describe('CLI invocation', () => { const exitCode = await proc.exited expect(exitCode).toBe(0) // Exits cleanly when showing help - expect(stdout).toContain('acp-harness') + expect(stdout).toContain('agent-eval-harness') }) test('help shows example commands', async () => { @@ -64,7 +64,7 @@ describe('CLI invocation', () => { }) const stdout = await new Response(proc.stdout).text() - expect(stdout).toContain('bunx claude-code-acp') + expect(stdout).toContain('--schema') expect(stdout).toContain('prompts.jsonl') expect(stdout).toContain('results.jsonl') }) @@ -84,8 +84,8 @@ describe('CLI invocation', () => { expect(stdout).toContain('schemas') }) - test('fails with non-existent prompts file', async () => { - const proc = Bun.spawn(['bun', CLI_PATH, 'capture', 'nonexistent.jsonl', 'bunx', 'claude-code-acp'], { + test('fails with non-existent schema file', async () => { + const proc = Bun.spawn(['bun', CLI_PATH, 'capture', 'prompts.jsonl', '--schema', 'nonexistent.json'], { stdout: 'pipe', stderr: 'pipe', }) @@ -93,10 +93,10 @@ describe('CLI invocation', () => { const exitCode = await proc.exited expect(exitCode).not.toBe(0) - expect(stderr).toContain('no such file or directory') + expect(stderr).toContain('Schema file not found') }) - test('fails when no agent command provided', async () => { + test('fails when no schema provided', async () => { const tmpFile = `/tmp/test-prompts-${Date.now()}.jsonl` await Bun.write(tmpFile, '{"id":"test-001","input":"test"}\n') @@ -108,7 +108,7 @@ describe('CLI invocation', () => { const exitCode = await proc.exited expect(exitCode).toBe(1) - expect(stderr).toContain('ACP agent command is required') + expect(stderr).toContain('--schema is required') }) test('fails with unknown command', async () => { @@ -488,11 +488,11 @@ describe('MCP server config parsing', () => { // ============================================================================ describe('error handling', () => { - test('fails with invalid JSONL format', async () => { + test('fails when schema file does not exist', async () => { const tmpFile = `/tmp/invalid-${Date.now()}.jsonl` - await Bun.write(tmpFile, '{invalid json}\n') + await Bun.write(tmpFile, '{"id": "t1", "input": "test"}\n') - const proc = Bun.spawn(['bun', CLI_PATH, 'capture', tmpFile, 'bunx', 'claude-code-acp'], { + const proc = Bun.spawn(['bun', CLI_PATH, 'capture', tmpFile, '--schema', 'nonexistent-schema.json'], { stdout: 'pipe', stderr: 'pipe', }) @@ -500,7 +500,7 @@ describe('error handling', () => { const exitCode = await proc.exited expect(exitCode).not.toBe(0) - expect(stderr).toContain('Invalid prompt at line 1') + expect(stderr).toContain('Schema file not found') }) test('capture command requires prompts path', async () => { diff --git a/package.json b/package.json index 6440068..23e6469 100644 --- a/package.json +++ b/package.json @@ -1,7 +1,7 @@ { - "name": "@plaited/acp-harness", - "version": "0.4.4", - "description": "CLI tool for capturing agent trajectories from ACP-compatible agents", + "name": "@plaited/agent-eval-harness", + "version": "1.0.0-alpha.1", + "description": "CLI tool for capturing agent trajectories from headless CLI agents", "license": "ISC", "engines": { "bun": ">= v1.2.9" @@ -15,13 +15,13 @@ }, "homepage": "https://github.com/plaited/acp-harness/tree/main#readme", "bin": { - "acp-harness": "./bin/cli.ts" + "agent-eval-harness": "./bin/cli.ts" }, "type": "module", "exports": { - ".": "./src/acp.ts", + ".": "./src/harness.ts", "./schemas": "./src/schemas.ts", - "./harness": "./src/harness.ts" + "./headless": "./src/headless.ts" }, "files": [ "./src/**", @@ -57,8 +57,7 @@ "@plaited/development-skills": "0.6.3" }, "peerDependencies": { - "typescript-language-server": "^5.1.3", - "@agentclientprotocol/sdk": "^0.13.0" + "typescript-language-server": "^5.1.3" }, "devDependencies": { "@biomejs/biome": "2.3.11", diff --git a/src/acp-client.ts b/src/acp-client.ts deleted file mode 100644 index d4fca51..0000000 --- a/src/acp-client.ts +++ /dev/null @@ -1,507 +0,0 @@ -/** - * Headless ACP client for programmatic agent interaction. - * - * @remarks - * This client enables automated evaluation of ACP-compatible agents like - * Claude Code, Droid, Gemini CLI, and others. It provides: - * - * - **Subprocess management**: Spawn and control agent processes - * - **Session handling**: Create and manage conversation sessions - * - **Streaming prompts**: AsyncGenerator for real-time updates - * - **Sync prompts**: Simple request/response for basic evals - * - **Auto-permissions**: Automatically approves all permissions for headless use - * - * Designed for testing and evaluation, not for user-facing applications. - */ - -import type { - AgentCapabilities, - CancelNotification, - ClientCapabilities, - ContentBlock, - Implementation, - InitializeRequest, - InitializeResponse, - PromptRequest, - PromptResponse, - RequestPermissionRequest, - RequestPermissionResponse, - SessionNotification, -} from '@agentclientprotocol/sdk' -import { version } from '../package.json' with { type: 'json' } -import { createACPTransport } from './acp-transport.ts' -import { ACP_METHODS, ACP_PROTOCOL_VERSION, DEFAULT_ACP_CLIENT_NAME, DEFAULT_POLLING_INTERVAL } from './constants.ts' -import type { Session } from './schemas.ts' -import { RequestPermissionRequestSchema, SessionNotificationSchema } from './schemas.ts' - -// ============================================================================ -// Types -// ============================================================================ - -/** Configuration for the ACP client */ -export type ACPClientConfig = { - /** Command to spawn agent (e.g., ['claude', 'code'] or ['droid']) */ - command: string[] - /** Working directory for agent process */ - cwd?: string - /** Environment variables for agent process */ - env?: Record - /** Client info for initialization */ - clientInfo?: Implementation - /** Client capabilities to advertise */ - capabilities?: ClientCapabilities - /** Timeout for operations in milliseconds (default: 30000) */ - timeout?: number - /** - * Polling interval for streaming updates in milliseconds (default: 50). - * Lower values provide more responsive updates but increase CPU usage. - * Consider increasing for testing to reduce timing-related flakiness. - */ - pollingInterval?: number - /** - * Permission handler for agent requests. - * Default: auto-approve all permissions (headless mode) - */ - onPermissionRequest?: (params: RequestPermissionRequest) => Promise -} - -/** Session update emitted during prompt streaming */ -export type SessionUpdate = { - type: 'update' - params: SessionNotification -} - -/** Prompt completion emitted when prompt finishes */ -export type PromptComplete = { - type: 'complete' - result: PromptResponse -} - -/** Events emitted during prompt streaming */ -export type PromptEvent = SessionUpdate | PromptComplete - -/** Error thrown by ACP client operations */ -export class ACPClientError extends Error { - constructor( - message: string, - public readonly code?: string, - ) { - super(message) - this.name = 'ACPClientError' - } -} - -// ============================================================================ -// Client Implementation -// ============================================================================ - -/** - * Creates a headless ACP client for agent evaluation. - * - * @param config - Client configuration including command, cwd, and permission handling - * @returns Client object with lifecycle, session, and prompt methods - * - * @remarks - * The client manages: - * - Agent subprocess lifecycle (connect/disconnect) - * - Protocol initialization and capability negotiation - * - Session creation and management - * - Prompt streaming with real-time updates - * - Automatic permission approval for headless evaluation - * - * See module-level documentation in `src/acp.ts` for usage guidance. - * See client tests for usage patterns. - */ -export const createACPClient = (config: ACPClientConfig) => { - const { - command, - cwd, - env, - clientInfo = { name: DEFAULT_ACP_CLIENT_NAME, version }, - capabilities = {}, - timeout = 30000, - pollingInterval = DEFAULT_POLLING_INTERVAL, - onPermissionRequest, - } = config - - let transport: ReturnType | undefined - let agentCapabilities: AgentCapabilities | undefined - let initializeResult: InitializeResponse | undefined - - // Track active prompt sessions for update routing - const activePrompts = new Map< - string, - { - updates: SessionNotification[] - resolve: (result: PromptResponse) => void - reject: (error: Error) => void - } - >() - - // -------------------------------------------------------------------------- - // Permission Handling - // -------------------------------------------------------------------------- - - /** - * Default permission handler: auto-approve all requests. - * For headless evaluation in trusted environments. - * - * @remarks - * Validates params with Zod before processing. - * Prioritizes `allow_always` for faster headless evaluation with fewer - * permission round-trips. Cancels if validation fails or no allow option - * is available. - */ - const autoApprovePermission = async (params: RequestPermissionRequest): Promise => { - const result = RequestPermissionRequestSchema.safeParse(params) - if (!result.success) { - return { outcome: { outcome: 'cancelled' } } - } - - const { options } = result.data - - // Priority: allow_always (fewer round-trips) > allow_once - const allowAlways = options.find((opt) => opt.kind === 'allow_always') - if (allowAlways) { - return { outcome: { outcome: 'selected', optionId: allowAlways.optionId } } - } - - const allowOnce = options.find((opt) => opt.kind === 'allow_once') - if (allowOnce) { - return { outcome: { outcome: 'selected', optionId: allowOnce.optionId } } - } - - // No allow option available - cancel - return { outcome: { outcome: 'cancelled' } } - } - - const handlePermissionRequest = onPermissionRequest ?? autoApprovePermission - - // -------------------------------------------------------------------------- - // Transport Callbacks - // -------------------------------------------------------------------------- - - const onNotification = (method: string, params: unknown) => { - if (method === ACP_METHODS.UPDATE) { - const updateParams = SessionNotificationSchema.parse(params) - const activePrompt = activePrompts.get(updateParams.sessionId) - if (activePrompt) { - activePrompt.updates.push(updateParams) - } - } - } - - const onRequest = async (method: string, params: unknown): Promise => { - if (method === ACP_METHODS.REQUEST_PERMISSION) { - return handlePermissionRequest(RequestPermissionRequestSchema.parse(params)) - } - - throw new ACPClientError(`Unknown request method: ${method}`) - } - - // -------------------------------------------------------------------------- - // Lifecycle Methods - // -------------------------------------------------------------------------- - - /** - * Connects to the agent by spawning the subprocess and initializing the protocol. - * - * @returns Initialize result with agent capabilities - * @throws {ACPClientError} If already connected or connection fails - */ - const connect = async (): Promise => { - if (transport?.isConnected()) { - throw new ACPClientError('Already connected') - } - - transport = createACPTransport({ - command, - cwd, - env, - timeout, - onNotification, - onRequest, - onError: (error) => { - console.error('[ACP Client Error]:', error.message) - }, - onClose: (code) => { - // Reject all active prompts on unexpected close - for (const [sessionId, prompt] of activePrompts) { - prompt.reject(new ACPClientError(`Agent process exited with code ${code}`)) - activePrompts.delete(sessionId) - } - }, - }) - - await transport.start() - - // Initialize protocol - const initParams: InitializeRequest = { - protocolVersion: ACP_PROTOCOL_VERSION, - clientInfo, - clientCapabilities: capabilities, - } - - initializeResult = await transport.request(ACP_METHODS.INITIALIZE, initParams) - - agentCapabilities = initializeResult?.agentCapabilities - - return initializeResult - } - - /** - * Disconnects from the agent, closing the subprocess. - * - * @param graceful - If true, sends shutdown request first (default: true) - */ - const disconnect = async (graceful = true): Promise => { - if (!transport) return - - // Cancel all active prompts - for (const [sessionId, prompt] of activePrompts) { - prompt.reject(new ACPClientError('Client disconnected')) - activePrompts.delete(sessionId) - } - - await transport.close(graceful) - transport = undefined - agentCapabilities = undefined - initializeResult = undefined - } - - // -------------------------------------------------------------------------- - // Session Methods - // -------------------------------------------------------------------------- - - /** - * Creates a new conversation session. - * - * @remarks - * MCP servers are auto-discovered by the agent from configuration files - * in the working directory (e.g., `.mcp.json`, `.gemini/settings.json`). - * - * @param params - Session parameters with working directory - * @returns The created session - * @throws {ACPClientError} If not connected - */ - const createSession = async (params: { cwd: string }): Promise => { - if (!transport?.isConnected()) { - throw new ACPClientError('Not connected') - } - - const response = await transport.request<{ sessionId: string }>(ACP_METHODS.CREATE_SESSION, { - cwd: params.cwd, - mcpServers: [], // Required field - empty array lets agents auto-discover from cwd - }) - return { id: response.sessionId } - } - - /** - * Sets the model for a session. - * - * @experimental This is an unstable ACP feature and may change. - * @param sessionId - The session ID to set the model for - * @param modelId - The model ID (e.g., 'claude-3-5-haiku-20241022', 'claude-sonnet-4-20250514') - * @throws {ACPClientError} If not connected - */ - const setModel = async (sessionId: string, modelId: string): Promise => { - if (!transport?.isConnected()) { - throw new ACPClientError('Not connected') - } - - await transport.request(ACP_METHODS.SET_MODEL, { sessionId, modelId }) - } - - // -------------------------------------------------------------------------- - // Prompt Methods - // -------------------------------------------------------------------------- - - /** - * Sends a prompt and streams updates as they arrive. - * - * @param sessionId - The session ID to send the prompt to - * @param content - Content blocks for the prompt - * @yields Session updates and final completion - * @throws {ACPClientError} If not connected - * - * @remarks - * Use this for evaluation scenarios where you need access to - * intermediate updates (tool calls, plan changes, etc). - */ - async function* prompt(sessionId: string, content: ContentBlock[]): AsyncGenerator { - if (!transport?.isConnected()) { - throw new ACPClientError('Not connected') - } - - const { promise, resolve, reject } = Promise.withResolvers() - const updates: SessionNotification[] = [] - const promptState = { - updates, - resolve, - reject, - } - - activePrompts.set(sessionId, promptState) - - // Send prompt request - const promptParams: PromptRequest = { - sessionId, - prompt: content, - } - - // Start the prompt request (don't await - we'll poll for updates) - const promptPromise = transport - .request(ACP_METHODS.PROMPT, promptParams) - .then(resolve) - .catch(reject) - - try { - // Poll for updates until prompt completes - let lastYieldedIndex = 0 - - while (true) { - // Yield any new updates - while (lastYieldedIndex < promptState.updates.length) { - const update = promptState.updates[lastYieldedIndex] - if (update) { - yield { type: 'update', params: update } - } - lastYieldedIndex++ - } - - // Check if prompt completed - const raceResult = await Promise.race([ - promise.then((result) => ({ done: true as const, result })), - new Promise<{ done: false }>((res) => setTimeout(() => res({ done: false }), pollingInterval)), - ]) - - if (raceResult.done) { - // Yield any remaining updates - while (lastYieldedIndex < promptState.updates.length) { - const update = promptState.updates[lastYieldedIndex] - if (update) { - yield { type: 'update', params: update } - } - lastYieldedIndex++ - } - - // Yield completion - yield { - type: 'complete', - result: raceResult.result, - } - break - } - } - - await promptPromise - } finally { - activePrompts.delete(sessionId) - } - } - - /** - * Sends a prompt and waits for the final result. - * - * @param sessionId - The session ID to send the prompt to - * @param content - Content blocks for the prompt - * @returns The prompt result with all accumulated updates - * @throws {ACPClientError} If not connected - * - * @remarks - * Use this for simple evaluation scenarios where you only need - * the final result. All intermediate updates are collected but - * returned together at the end. - */ - const promptSync = async ( - sessionId: string, - content: ContentBlock[], - ): Promise<{ - result: PromptResponse - updates: SessionNotification[] - }> => { - const updates: SessionNotification[] = [] - let result: PromptResponse | undefined - - for await (const event of prompt(sessionId, content)) { - if (event.type === 'update') { - updates.push(event.params) - } else if (event.type === 'complete') { - result = event.result - } - } - - if (!result) { - throw new ACPClientError('Prompt completed without result') - } - - return { result, updates } - } - - /** - * Cancels an ongoing prompt. - * - * @param sessionId - The session ID to cancel - * @throws {ACPClientError} If not connected - */ - const cancelPrompt = async (sessionId: string): Promise => { - if (!transport?.isConnected()) { - throw new ACPClientError('Not connected') - } - - const cancelParams: CancelNotification = { sessionId } - await transport.notify(ACP_METHODS.CANCEL, cancelParams) - } - - // -------------------------------------------------------------------------- - // State Methods - // -------------------------------------------------------------------------- - - /** - * Gets the agent capabilities negotiated during initialization. - * - * @returns Agent capabilities or undefined if not connected - */ - const getCapabilities = (): AgentCapabilities | undefined => { - return agentCapabilities - } - - /** - * Gets the full initialization result. - * - * @returns Initialize result or undefined if not connected - */ - const getInitializeResult = (): InitializeResponse | undefined => { - return initializeResult - } - - /** - * Checks if the client is connected to an agent. - */ - const isConnected = (): boolean => { - return transport?.isConnected() ?? false - } - - return { - // Lifecycle - connect, - disconnect, - - // Sessions - createSession, - setModel, - - // Prompts - prompt, - promptSync, - cancelPrompt, - - // State - getCapabilities, - getInitializeResult, - isConnected, - } -} - -/** Client instance type */ -export type ACPClient = ReturnType diff --git a/src/acp-helpers.ts b/src/acp-helpers.ts deleted file mode 100644 index a7f72d5..0000000 --- a/src/acp-helpers.ts +++ /dev/null @@ -1,121 +0,0 @@ -/** - * High-level helper utilities for ACP prompt building and response analysis. - * - * @remarks - * Provides convenience functions for common ACP workflows: - * - Building prompts with text, files, and images - * - Summarizing agent responses for evaluation - * - * For low-level content manipulation, see internal utilities in acp-utils.ts. - */ - -import type { ContentBlock, PlanEntry, SessionNotification, ToolCall } from '@agentclientprotocol/sdk' -import { - createImageContent, - createTextContent, - createTextResource, - extractLatestToolCalls, - extractPlan, - extractTextFromUpdates, - filterToolCallsByStatus, - getPlanProgress, - hasToolCallErrors, -} from './acp-utils.ts' - -// ============================================================================ -// Prompt Building Utilities -// ============================================================================ - -/** - * Creates a simple text prompt. - * - * @param text - The prompt text - * @returns Array with single text content block - */ -export const createPrompt = (text: string): ContentBlock[] => { - return [createTextContent(text)] -} - -/** - * Creates a prompt with text and file context. - * - * @param text - The prompt text - * @param files - Array of file paths and contents to include - * @returns Array of content blocks - */ -export const createPromptWithFiles = ( - text: string, - files: Array<{ path: string; content: string }>, -): ContentBlock[] => { - const blocks: ContentBlock[] = [createTextContent(text)] - - for (const file of files) { - blocks.push(createTextResource({ uri: `file://${file.path}`, text: file.content, mimeType: 'text/plain' })) - } - - return blocks -} - -/** Parameters for creating a prompt with image */ -export type CreatePromptWithImageParams = { - /** The prompt text */ - text: string - /** Base64-encoded image data */ - imageData: string - /** Image MIME type */ - mimeType: string -} - -/** - * Creates a prompt with text and image. - * - * @param params - Prompt with image parameters - * @returns Array of content blocks - */ -export const createPromptWithImage = ({ text, imageData, mimeType }: CreatePromptWithImageParams): ContentBlock[] => { - return [createTextContent(text), createImageContent(imageData, mimeType)] -} - -// ============================================================================ -// Response Analysis Utilities -// ============================================================================ - -/** Summary of a prompt response for evaluation */ -export type PromptResponseSummary = { - /** Concatenated text output */ - text: string - /** Number of tool calls made */ - toolCallCount: number - /** Tool calls that completed */ - completedToolCalls: ToolCall[] - /** Tool calls that failed */ - failedToolCalls: ToolCall[] - /** Final plan state */ - plan?: PlanEntry[] - /** Plan completion percentage */ - planProgress?: number - /** Whether any errors occurred */ - hasErrors: boolean -} - -/** - * Creates a summary of a prompt response for evaluation. - * - * @param notifications - Session notifications from the prompt - * @returns Response summary - */ -export const summarizeResponse = (notifications: SessionNotification[]): PromptResponseSummary => { - const text = extractTextFromUpdates(notifications) - const toolCalls = [...extractLatestToolCalls(notifications).values()] - const plan = extractPlan(notifications) - - return { - text, - toolCallCount: toolCalls.length, - completedToolCalls: filterToolCallsByStatus(toolCalls, 'completed'), - failedToolCalls: filterToolCallsByStatus(toolCalls, 'failed'), - plan, - planProgress: plan ? getPlanProgress(plan) : undefined, - hasErrors: hasToolCallErrors(toolCalls), - } -} diff --git a/src/acp-transport.ts b/src/acp-transport.ts deleted file mode 100644 index 56f1d32..0000000 --- a/src/acp-transport.ts +++ /dev/null @@ -1,462 +0,0 @@ -/** - * ACP stdio transport for subprocess communication. - * - * @remarks - * Manages bidirectional JSON-RPC 2.0 communication with ACP agents over - * stdin/stdout. Handles message framing, request/response correlation, - * and notification streaming. - * - * The transport spawns the agent as a subprocess and communicates using - * newline-delimited JSON messages with Zod runtime validation. - */ - -import { JSON_RPC_ERRORS } from './constants.ts' -import type { - JsonRpcError, - JsonRpcErrorResponse, - JsonRpcMessage, - JsonRpcNotification, - JsonRpcRequest, - JsonRpcResponse, - JsonRpcSuccessResponse, -} from './schemas.ts' -import { JsonRpcMessageSchema } from './schemas.ts' - -// ============================================================================ -// Types -// ============================================================================ - -/** Configuration for the ACP transport */ -export type ACPTransportConfig = { - /** Command to spawn agent (e.g., ['claude', 'code', '--print-acp-config']) */ - command: string[] - /** Working directory for agent process */ - cwd?: string - /** Environment variables for agent process */ - env?: Record - /** Timeout for requests in milliseconds (default: 30000) */ - timeout?: number - /** Callback for incoming notifications */ - onNotification?: (method: string, params: unknown) => void - /** Callback for incoming requests (agent → client) */ - onRequest?: (method: string, params: unknown) => Promise - /** Callback for transport errors */ - onError?: (error: Error) => void - /** Callback when transport closes */ - onClose?: (code: number | null) => void -} - -/** Pending request tracker */ -type PendingRequest = { - resolve: (result: unknown) => void - reject: (error: Error) => void - timer: Timer -} - -/** Bun FileSink for subprocess stdin */ -type FileSink = { - write: (data: string | ArrayBufferView | ArrayBuffer) => number - flush: () => void - end: () => void -} - -/** Subprocess type with piped stdio (Bun.spawn return type) */ -type PipedSubprocess = { - stdin: FileSink - stdout: ReadableStream - stderr: ReadableStream - exited: Promise - kill: (signal?: number) => void - pid: number -} - -/** Custom error for ACP transport failures */ -export class ACPTransportError extends Error { - constructor( - message: string, - public readonly code?: number, - public readonly data?: unknown, - ) { - super(message) - this.name = 'ACPTransportError' - } - - /** Create from JSON-RPC error */ - static fromJsonRpcError(error: JsonRpcError): ACPTransportError { - return new ACPTransportError(error.message, error.code, error.data) - } -} - -// ============================================================================ -// Transport Implementation -// ============================================================================ - -/** - * Creates an ACP transport for subprocess communication. - * - * @param config - Transport configuration - * @returns Transport object with send/close methods - * - * @remarks - * The transport handles: - * - Spawning the agent subprocess - * - JSON-RPC message framing over stdio - * - Request/response correlation with timeouts - * - Notification and request routing - * - Graceful shutdown - * - Runtime validation of incoming messages via Zod - */ -export const createACPTransport = (config: ACPTransportConfig) => { - const { command, cwd, env, timeout = 30000, onNotification, onRequest, onError, onClose } = config - - let subprocess: PipedSubprocess | undefined - let nextId = 1 - const pendingRequests = new Map() - let buffer = '' - let isClosing = false - - // Stream readers for explicit cleanup - // Use global ReadableStreamDefaultReader type (Bun's type includes readMany) - let stdoutReader: globalThis.ReadableStreamDefaultReader | undefined - let stderrReader: globalThis.ReadableStreamDefaultReader | undefined - - // -------------------------------------------------------------------------- - // Message Parsing (with Zod validation) - // -------------------------------------------------------------------------- - - const parseMessages = (data: string): JsonRpcMessage[] => { - buffer += data - const messages: JsonRpcMessage[] = [] - const lines = buffer.split('\n') - - // Keep incomplete last line in buffer - buffer = lines.pop() ?? '' - - for (const line of lines) { - const trimmed = line.trim() - if (!trimmed) continue - - // Skip lines that don't look like JSON objects (debug output from adapters) - if (!trimmed.startsWith('{')) continue - - try { - const json = JSON.parse(trimmed) - const result = JsonRpcMessageSchema.safeParse(json) - - if (!result.success) { - // Only log if it looked like valid JSON but failed schema validation - onError?.(new Error(`Invalid JSON-RPC message: ${result.error.message}`)) - continue - } - - messages.push(result.data as JsonRpcMessage) - } catch { - // Silently skip non-JSON lines (common with debug output) - } - } - - return messages - } - - // -------------------------------------------------------------------------- - // Message Handling - // -------------------------------------------------------------------------- - - const handleMessage = async (message: JsonRpcMessage) => { - // Response to our request - if ( - 'id' in message && - message.id !== undefined && - message.id !== null && - ('result' in message || 'error' in message) - ) { - const response = message as JsonRpcResponse - const id = response.id as string | number - const pending = pendingRequests.get(id) - if (pending) { - pendingRequests.delete(id) - clearTimeout(pending.timer) - - if ('error' in response) { - pending.reject(ACPTransportError.fromJsonRpcError(response.error)) - } else { - pending.resolve(response.result) - } - } - return - } - - // Request from agent (e.g., permission request, file read) - if ('id' in message && message.id !== undefined && message.id !== null && 'method' in message) { - const request = message as JsonRpcRequest - const id = request.id as string | number - if (onRequest) { - try { - const result = await onRequest(request.method, request.params) - await sendResponse(id, result) - } catch (err) { - const error = err instanceof Error ? err : new Error(String(err)) - await sendErrorResponse(id, JSON_RPC_ERRORS.INTERNAL_ERROR, error.message) - } - } else { - // No handler, respond with method not found - await sendErrorResponse(id, JSON_RPC_ERRORS.METHOD_NOT_FOUND, `No handler for ${request.method}`) - } - return - } - - // Notification from agent - if ('method' in message && !('id' in message)) { - const notification = message as JsonRpcNotification - onNotification?.(notification.method, notification.params) - } - } - - // -------------------------------------------------------------------------- - // Sending Messages - // -------------------------------------------------------------------------- - - const sendRaw = (message: JsonRpcMessage): void => { - if (!subprocess || isClosing) { - throw new ACPTransportError('Transport is not connected') - } - - const json = `${JSON.stringify(message)}\n` - subprocess.stdin.write(json) - subprocess.stdin.flush() - } - - const sendResponse = async (id: string | number, result: unknown): Promise => { - const response: JsonRpcSuccessResponse = { - jsonrpc: '2.0', - id, - result, - } - sendRaw(response) - } - - const sendErrorResponse = async (id: string | number, code: number, message: string): Promise => { - const response: JsonRpcErrorResponse = { - jsonrpc: '2.0', - id, - error: { code, message }, - } - sendRaw(response) - } - - // -------------------------------------------------------------------------- - // Public API - // -------------------------------------------------------------------------- - - /** - * Starts the transport by spawning the agent subprocess. - * - * @throws {ACPTransportError} If the subprocess fails to start - */ - const start = async (): Promise => { - if (subprocess) { - throw new ACPTransportError('Transport already started') - } - - if (command.length === 0) { - throw new ACPTransportError('Command array is empty') - } - - const proc = Bun.spawn(command, { - cwd, - env: { ...Bun.env, ...env }, - stdin: 'pipe', - stdout: 'pipe', - stderr: 'pipe', - }) - - // Cast to our expected type - Bun.spawn with 'pipe' options returns streams - subprocess = proc as unknown as PipedSubprocess - - // Read stdout for JSON-RPC messages - const readStdout = async () => { - if (!subprocess) { - throw new ACPTransportError('Subprocess not started') - } - // Type assertion needed: Bun's ReadableStreamDefaultReader includes readMany - // but node:stream/web reader returned by getReader() doesn't have it - const reader = subprocess.stdout.getReader() as globalThis.ReadableStreamDefaultReader - stdoutReader = reader - const decoder = new TextDecoder() - - try { - while (true) { - const { done, value } = await reader.read() - if (done) break - - const text = decoder.decode(value, { stream: true }) - const messages = parseMessages(text) - for (const message of messages) { - await handleMessage(message) - } - } - } catch (err) { - if (!isClosing) { - onError?.(err instanceof Error ? err : new Error(String(err))) - } - } finally { - stdoutReader = undefined - } - } - - // Read stderr for debugging - const readStderr = async () => { - if (!subprocess) { - throw new ACPTransportError('Subprocess not started') - } - // Type assertion needed: Bun's ReadableStreamDefaultReader includes readMany - // but node:stream/web reader returned by getReader() doesn't have it - const reader = subprocess.stderr.getReader() as globalThis.ReadableStreamDefaultReader - stderrReader = reader - const decoder = new TextDecoder() - - try { - while (true) { - const { done, value } = await reader.read() - if (done) break - // Log stderr for debugging but don't treat as error - const text = decoder.decode(value, { stream: true }) - if (text.trim()) { - console.error('[ACP Agent stderr]:', text.trim()) - } - } - } catch { - // Ignore stderr read errors - } finally { - stderrReader = undefined - } - } - - // Start reading streams (fire-and-forget pattern) - // These run concurrently and clean up via optional chaining in close() - readStdout() - readStderr() - - // Monitor process exit - subprocess.exited.then((code) => { - if (!isClosing) { - // Reject all pending requests - for (const [id, pending] of pendingRequests) { - clearTimeout(pending.timer) - pending.reject(new ACPTransportError(`Process exited with code ${code}`)) - pendingRequests.delete(id) - } - onClose?.(code) - } - }) - } - - /** - * Sends a JSON-RPC request and waits for response. - * - * @param method - The RPC method name - * @param params - Optional parameters - * @returns The result from the response - * @throws {ACPTransportError} On timeout, transport error, or RPC error - */ - const request = async (method: string, params?: unknown): Promise => { - const id = nextId++ - - const rpcRequest: JsonRpcRequest = { - jsonrpc: '2.0', - id, - method, - ...(params !== undefined && { params }), - } - - const { promise, resolve, reject } = Promise.withResolvers() - - const timer = setTimeout(() => { - pendingRequests.delete(id) - reject(new ACPTransportError(`Request timed out after ${timeout}ms`, JSON_RPC_ERRORS.INTERNAL_ERROR)) - }, timeout) - - pendingRequests.set(id, { resolve, reject, timer }) - - try { - sendRaw(rpcRequest) - } catch (err) { - pendingRequests.delete(id) - clearTimeout(timer) - throw err - } - - return promise as Promise - } - - /** - * Sends a JSON-RPC notification (no response expected). - * - * @param method - The notification method name - * @param params - Optional parameters - */ - const notify = async (method: string, params?: unknown): Promise => { - const notification: JsonRpcNotification = { - jsonrpc: '2.0', - method, - ...(params !== undefined && { params }), - } - sendRaw(notification) - } - - /** - * Cancels a pending request using the ACP cancel notification. - * - * @param requestId - The ID of the request to cancel - */ - const cancelRequest = async (requestId: string | number): Promise => { - // Use SDK's CancelRequestNotification format - await notify('$/cancel_request', { requestId }) - } - - /** - * Closes the transport and terminates the subprocess. - * - * @param graceful - If true, sends shutdown request first (default: true) - */ - const close = async (graceful = true): Promise => { - if (!subprocess || isClosing) return - isClosing = true - - // Cancel all pending requests - for (const [id, pending] of pendingRequests) { - clearTimeout(pending.timer) - pending.reject(new ACPTransportError('Transport closed')) - pendingRequests.delete(id) - } - - try { - if (graceful) { - // Try graceful shutdown - not in SDK, use string literal - await request('shutdown').catch(() => {}) - } - } finally { - // Release stream readers to allow clean subprocess termination - await Promise.all([stdoutReader?.cancel().catch(() => {}), stderrReader?.cancel().catch(() => {})]) - - subprocess.kill() - subprocess = undefined - } - } - - /** - * Checks if the transport is connected. - */ - const isConnected = (): boolean => { - return subprocess !== undefined && !isClosing - } - - return { - start, - request, - notify, - cancelRequest, - close, - isConnected, - } -} diff --git a/src/acp-utils.ts b/src/acp-utils.ts deleted file mode 100644 index 1f529e6..0000000 --- a/src/acp-utils.ts +++ /dev/null @@ -1,341 +0,0 @@ -/** - * Internal utilities for ACP content manipulation. - * - * @remarks - * Low-level functions for building content blocks and extracting data - * from session responses. These are used internally by the higher-level - * helpers in acp-helpers.ts. - * - * @internal - */ - -import type { - BlobResourceContents, - ContentBlock, - PlanEntry, - SessionNotification, - SessionUpdate, - TextContent, - TextResourceContents, - ToolCall, - ToolCallContent, -} from '@agentclientprotocol/sdk' - -// ============================================================================ -// Content Block Builders -// ============================================================================ - -/** - * Creates a text content block. - * - * @param text - The text content - * @returns Text content block - */ -export const createTextContent = (text: string): ContentBlock => ({ - type: 'text', - text, -}) - -/** - * Creates an image content block from base64 data. - * - * @param data - Base64-encoded image data - * @param mimeType - MIME type (e.g., 'image/png', 'image/jpeg') - * @returns Image content block - */ -export const createImageContent = (data: string, mimeType: string): ContentBlock => ({ - type: 'image', - data, - mimeType, -}) - -/** - * Creates an audio content block from base64 data. - * - * @param data - Base64-encoded audio data - * @param mimeType - MIME type (e.g., 'audio/wav', 'audio/mp3') - * @returns Audio content block - */ -export const createAudioContent = (data: string, mimeType: string): ContentBlock => ({ - type: 'audio', - data, - mimeType, -}) - -/** Parameters for creating a resource link */ -export type CreateResourceLinkParams = { - /** URI to the resource */ - uri: string - /** Resource name (required by SDK) */ - name: string - /** Optional MIME type */ - mimeType?: string -} - -/** - * Creates a resource link content block. - * - * @param params - Resource link parameters - * @returns Resource link content block - */ -export const createResourceLink = ({ uri, name, mimeType }: CreateResourceLinkParams): ContentBlock => ({ - type: 'resource_link', - uri, - name, - ...(mimeType && { mimeType }), -}) - -/** Parameters for creating an embedded text resource */ -export type CreateTextResourceParams = { - /** URI identifying the resource */ - uri: string - /** Text content of the resource */ - text: string - /** Optional MIME type */ - mimeType?: string -} - -/** - * Creates an embedded text resource content block. - * - * @param params - Text resource parameters - * @returns Resource content block - */ -export const createTextResource = ({ uri, text, mimeType }: CreateTextResourceParams): ContentBlock => ({ - type: 'resource', - resource: { - uri, - text, - ...(mimeType && { mimeType }), - } as TextResourceContents, -}) - -/** Parameters for creating an embedded blob resource */ -export type CreateBlobResourceParams = { - /** URI identifying the resource */ - uri: string - /** Base64-encoded binary data */ - blob: string - /** Optional MIME type */ - mimeType?: string -} - -/** - * Creates an embedded blob resource content block. - * - * @param params - Blob resource parameters - * @returns Resource content block - */ -export const createBlobResource = ({ uri, blob, mimeType }: CreateBlobResourceParams): ContentBlock => ({ - type: 'resource', - resource: { - uri, - blob, - ...(mimeType && { mimeType }), - } as BlobResourceContents, -}) - -// ============================================================================ -// Content Extraction -// ============================================================================ - -/** - * Extracts all text from content blocks. - * - * @param content - Array of content blocks - * @returns Concatenated text content - */ -export const extractText = (content: ContentBlock[]): string => { - return content - .filter((block): block is TextContent & { type: 'text' } => block.type === 'text') - .map((block) => block.text) - .join('\n') -} - -/** - * Helper to extract content from SessionUpdate (discriminated union) - */ -const getUpdateContent = (update: SessionUpdate): ContentBlock | undefined => { - if ( - update.sessionUpdate === 'user_message_chunk' || - update.sessionUpdate === 'agent_message_chunk' || - update.sessionUpdate === 'agent_thought_chunk' - ) { - return update.content - } - return undefined -} - -/** - * Helper to extract tool call from SessionUpdate - */ -const getUpdateToolCall = (update: SessionUpdate): ToolCall | undefined => { - if (update.sessionUpdate === 'tool_call') { - return update - } - return undefined -} - -/** - * Helper to extract plan from SessionUpdate - */ -const getUpdatePlan = (update: SessionUpdate): PlanEntry[] | undefined => { - if (update.sessionUpdate === 'plan') { - return update.entries - } - return undefined -} - -/** - * Extracts text from session notifications. - * - * @remarks - * Streaming produces partial tokens that should be concatenated directly. - * Uses empty string join to preserve the original text structure. - * - * @param notifications - Array of session notifications - * @returns Concatenated text from all updates - */ -export const extractTextFromUpdates = (notifications: SessionNotification[]): string => { - const texts: string[] = [] - for (const notification of notifications) { - const content = getUpdateContent(notification.update) - if (content && content.type === 'text') { - texts.push(content.text) - } - } - // Join without separator - streaming chunks should be concatenated directly - return texts.join('') -} - -/** - * Extracts all tool calls from session notifications. - * - * @param notifications - Array of session notifications - * @returns Array of all tool calls - */ -export const extractToolCalls = (notifications: SessionNotification[]): ToolCall[] => { - const calls: ToolCall[] = [] - for (const notification of notifications) { - const toolCall = getUpdateToolCall(notification.update) - if (toolCall) { - calls.push(toolCall) - } - } - return calls -} - -/** - * Extracts the latest state of each tool call (deduplicated by toolCallId). - * - * @param notifications - Array of session notifications - * @returns Map of tool call ID to latest tool call state - */ -export const extractLatestToolCalls = (notifications: SessionNotification[]): Map => { - const latest = new Map() - for (const notification of notifications) { - const toolCall = getUpdateToolCall(notification.update) - if (toolCall) { - latest.set(toolCall.toolCallId, toolCall) - } - } - return latest -} - -/** - * Extracts the latest plan from session notifications. - * - * @param notifications - Array of session notifications - * @returns Latest plan entries or undefined if no plan - */ -export const extractPlan = (notifications: SessionNotification[]): PlanEntry[] | undefined => { - // Plans are replaced entirely, so find the last one - for (let i = notifications.length - 1; i >= 0; i--) { - const notification = notifications[i] - if (notification) { - const plan = getUpdatePlan(notification.update) - if (plan) { - return plan - } - } - } - return undefined -} - -// ============================================================================ -// Tool Call Utilities -// ============================================================================ - -/** - * Filters tool calls by status. - * - * @param toolCalls - Array of tool calls - * @param status - Status to filter by - * @returns Filtered tool calls - */ -export const filterToolCallsByStatus = (toolCalls: ToolCall[], status: ToolCall['status']): ToolCall[] => { - return toolCalls.filter((call) => call.status === status) -} - -/** - * Filters tool calls by title. - * - * @param toolCalls - Array of tool calls - * @param title - Tool title to filter by - * @returns Filtered tool calls - */ -export const filterToolCallsByTitle = (toolCalls: ToolCall[], title: string): ToolCall[] => { - return toolCalls.filter((call) => call.title === title) -} - -/** - * Checks if any tool calls have failed. - * - * @param toolCalls - Array of tool calls - * @returns True if any tool call has 'failed' status - */ -export const hasToolCallErrors = (toolCalls: ToolCall[]): boolean => { - return toolCalls.some((call) => call.status === 'failed') -} - -/** - * Gets completed tool calls with their output content. - * - * @param toolCalls - Array of tool calls - * @returns Tool calls that completed with content - */ -export const getCompletedToolCallsWithContent = ( - toolCalls: ToolCall[], -): Array => { - return toolCalls.filter( - (call): call is ToolCall & { content: ToolCallContent[] } => - call.status === 'completed' && call.content !== undefined && call.content.length > 0, - ) -} - -// ============================================================================ -// Plan Utilities -// ============================================================================ - -/** - * Gets plan entries by status. - * - * @param plan - Array of plan entries - * @param status - Status to filter by - * @returns Filtered plan entries - */ -export const filterPlanByStatus = (plan: PlanEntry[], status: PlanEntry['status']): PlanEntry[] => { - return plan.filter((entry) => entry.status === status) -} - -/** - * Calculates plan completion percentage. - * - * @param plan - Array of plan entries - * @returns Percentage of completed entries (0-100) - */ -export const getPlanProgress = (plan: PlanEntry[]): number => { - if (plan.length === 0) return 100 - const completed = plan.filter((entry) => entry.status === 'completed').length - return Math.round((completed / plan.length) * 100) -} diff --git a/src/acp.ts b/src/acp.ts deleted file mode 100644 index a0541ee..0000000 --- a/src/acp.ts +++ /dev/null @@ -1,27 +0,0 @@ -/** - * @plaited/acp-harness - ACP client and evaluation harness for TypeScript/Bun projects. - * - * @remarks - * This module provides a headless ACP client for programmatic agent interaction, - * optimized for testing, evaluation, and training data generation. - * - * **Primary exports:** - * - `createACPClient` - Factory for headless ACP client instances - * - `createPrompt`, `createPromptWithFiles`, `createPromptWithImage` - Prompt builders - * - `summarizeResponse` - Response analysis utility - * - * **Re-exports from acp-utils (for advanced usage):** - * - Content builders: `createTextContent`, `createImageContent`, `createAudioContent`, - * `createResourceLink`, `createTextResource`, `createBlobResource` - * - Content extractors: `extractText`, `extractTextFromUpdates`, `extractToolCalls`, - * `extractLatestToolCalls`, `extractPlan` - * - Tool call utilities: `filterToolCallsByStatus`, `filterToolCallsByTitle`, - * `hasToolCallErrors`, `getCompletedToolCallsWithContent` - * - Plan utilities: `filterPlanByStatus`, `getPlanProgress` - * - * @packageDocumentation - */ - -export * from './acp-client.ts' -export * from './acp-helpers.ts' -export * from './acp-utils.ts' diff --git a/src/adapter-check.ts b/src/adapter-check.ts deleted file mode 100644 index 17ffe0d..0000000 --- a/src/adapter-check.ts +++ /dev/null @@ -1,541 +0,0 @@ -/** - * ACP adapter compliance checker. - * - * @remarks - * Validates that an adapter correctly implements the Agent Client Protocol - * by running a series of checks: - * - * 1. spawn - Adapter can be launched as subprocess - * 2. initialize - Responds with valid agentCapabilities - * 3. session/new - Creates session and returns sessionId - * 4. session/prompt - Accepts prompt and emits session/update notifications - * 5. session/cancel - Accepts cancel notification gracefully - * 6. framing - All messages are newline-delimited JSON-RPC 2.0 - * - * @packageDocumentation - */ - -import { parseArgs } from 'node:util' -import { createACPTransport } from './acp-transport.ts' -import { ACP_METHODS, ACP_PROTOCOL_VERSION, DEFAULT_ACP_CLIENT_NAME } from './constants.ts' - -// ============================================================================ -// Types -// ============================================================================ - -/** Configuration for compliance check */ -export type CheckConfig = { - /** Command to spawn adapter (e.g., ['bun', './src/main.ts']) */ - command: string[] - /** Timeout for each check in milliseconds */ - timeout: number - /** Show detailed protocol messages */ - verbose: boolean -} - -/** Result of a single check */ -export type CheckResult = { - /** Check name */ - name: string - /** Whether the check passed */ - passed: boolean - /** Human-readable message */ - message: string - /** Additional details (for verbose output) */ - details?: string -} - -/** Result of full compliance check */ -export type ComplianceResult = { - /** Whether all checks passed */ - passed: boolean - /** Individual check results */ - checks: CheckResult[] - /** Summary statistics */ - summary: { - total: number - passed: number - failed: number - } -} - -// ============================================================================ -// Check Implementations -// ============================================================================ - -/** - * Check: spawn - * Verify adapter can be launched as a subprocess. - */ -const checkSpawn = async (config: CheckConfig): Promise => { - const { command, timeout, verbose } = config - - try { - const transport = createACPTransport({ - command, - timeout, - onNotification: () => {}, - onRequest: async () => ({}), - onError: () => {}, - onClose: () => {}, - }) - - await transport.start() - await transport.close(false) // Don't send shutdown, just close - - return { - name: 'spawn', - passed: true, - message: 'Adapter launched successfully', - details: verbose ? `Command: ${command.join(' ')}` : undefined, - } - } catch (error) { - return { - name: 'spawn', - passed: false, - message: `Failed to spawn adapter: ${error instanceof Error ? error.message : String(error)}`, - } - } -} - -/** - * Check: initialize - * Verify adapter responds to initialize with valid agentCapabilities. - */ -const checkInitialize = async ( - config: CheckConfig, -): Promise<{ result: CheckResult; transport?: ReturnType; capabilities?: unknown }> => { - const { command, timeout, verbose } = config - - try { - const transport = createACPTransport({ - command, - timeout, - onNotification: () => {}, - onRequest: async () => ({}), - onError: () => {}, - onClose: () => {}, - }) - - await transport.start() - - const initResponse = await transport.request<{ - protocolVersion: number - agentInfo?: { name: string; version: string } - agentCapabilities?: Record - }>(ACP_METHODS.INITIALIZE, { - protocolVersion: ACP_PROTOCOL_VERSION, - clientInfo: { name: DEFAULT_ACP_CLIENT_NAME, version: '1.0.0' }, - clientCapabilities: {}, - }) - - if (!initResponse || initResponse.protocolVersion !== ACP_PROTOCOL_VERSION) { - await transport.close(false) - return { - result: { - name: 'initialize', - passed: false, - message: `Invalid protocol version: expected ${ACP_PROTOCOL_VERSION}, got ${initResponse?.protocolVersion}`, - }, - } - } - - const capabilities = initResponse.agentCapabilities ?? {} - const capList = Object.entries(capabilities) - .filter(([, v]) => v) - .map(([k, v]) => { - if (typeof v === 'object' && v !== null) { - const nested = Object.entries(v as Record) - .filter(([, nv]) => nv) - .map(([nk]) => nk) - return nested.length > 0 ? `${k}.${nested.join(', ')}` : k - } - return k - }) - - return { - result: { - name: 'initialize', - passed: true, - message: `Protocol version ${initResponse.protocolVersion}${capList.length > 0 ? `, capabilities: ${capList.join(', ')}` : ''}`, - details: verbose ? JSON.stringify(initResponse, null, 2) : undefined, - }, - transport, - capabilities, - } - } catch (error) { - return { - result: { - name: 'initialize', - passed: false, - message: `Initialize failed: ${error instanceof Error ? error.message : String(error)}`, - }, - } - } -} - -/** - * Check: session/new - * Verify adapter creates session and returns sessionId. - */ -const checkSessionNew = async ( - transport: ReturnType, - verbose: boolean, -): Promise<{ result: CheckResult; sessionId?: string }> => { - try { - const response = await transport.request<{ sessionId: string }>(ACP_METHODS.CREATE_SESSION, { - cwd: process.cwd(), - }) - - if (!response || !response.sessionId) { - return { - result: { - name: 'session/new', - passed: false, - message: 'No sessionId in response', - }, - } - } - - return { - result: { - name: 'session/new', - passed: true, - message: `Session ${response.sessionId} created`, - details: verbose ? JSON.stringify(response, null, 2) : undefined, - }, - sessionId: response.sessionId, - } - } catch (error) { - return { - result: { - name: 'session/new', - passed: false, - message: `session/new failed: ${error instanceof Error ? error.message : String(error)}`, - }, - } - } -} - -/** - * Check: session/prompt - * Verify adapter accepts prompt and emits session/update notifications. - */ -const checkSessionPrompt = async (config: CheckConfig, sessionId: string): Promise => { - const { command, timeout, verbose } = config - const updates: unknown[] = [] - - // Create a new transport with update collection - const transport = createACPTransport({ - command, - timeout, - onNotification: (method: string, params: unknown) => { - if (method === ACP_METHODS.UPDATE) { - updates.push(params) - } - }, - onRequest: async () => ({}), - onError: () => {}, - onClose: () => {}, - }) - - try { - await transport.start() - - // Re-initialize for new connection - await transport.request(ACP_METHODS.INITIALIZE, { - protocolVersion: ACP_PROTOCOL_VERSION, - clientInfo: { name: DEFAULT_ACP_CLIENT_NAME, version: '1.0.0' }, - clientCapabilities: {}, - }) - - const response = await transport.request<{ content: unknown[] }>(ACP_METHODS.PROMPT, { - sessionId, - prompt: [{ type: 'text', text: 'Hello, this is a test prompt.' }], - }) - - await transport.close(false) - - if (!response || !response.content) { - return { - name: 'session/prompt', - passed: false, - message: 'No content in response', - } - } - - // Categorize updates - const updateTypes = updates.map((u) => { - const update = u as { update?: { sessionUpdate?: string } } - return update?.update?.sessionUpdate ?? 'unknown' - }) - - const uniqueTypes = [...new Set(updateTypes)] - const typeDisplay = uniqueTypes.length > 0 ? uniqueTypes.join(', ') : 'none' - - return { - name: 'session/prompt', - passed: true, - message: `Received ${updates.length} update${updates.length !== 1 ? 's' : ''} (${typeDisplay})`, - details: verbose ? JSON.stringify({ updates, response }, null, 2) : undefined, - } - } catch (error) { - await transport.close(false).catch(() => {}) - - return { - name: 'session/prompt', - passed: false, - message: `session/prompt failed: ${error instanceof Error ? error.message : String(error)}`, - } - } -} - -/** - * Check: session/cancel - * Verify adapter accepts cancel notification gracefully. - */ -const checkSessionCancel = async (config: CheckConfig, sessionId: string): Promise => { - const { command, timeout, verbose } = config - - const transport = createACPTransport({ - command, - timeout, - onNotification: () => {}, - onRequest: async () => ({}), - onError: () => {}, - onClose: () => {}, - }) - - try { - await transport.start() - - // Re-initialize for new connection - await transport.request(ACP_METHODS.INITIALIZE, { - protocolVersion: ACP_PROTOCOL_VERSION, - clientInfo: { name: DEFAULT_ACP_CLIENT_NAME, version: '1.0.0' }, - clientCapabilities: {}, - }) - - await transport.notify(ACP_METHODS.CANCEL, { sessionId }) - - // Give adapter a moment to process the notification - await new Promise((resolve) => setTimeout(resolve, 100)) - - await transport.close(false) - - return { - name: 'session/cancel', - passed: true, - message: 'Acknowledged without error', - details: verbose ? `Sent cancel for session ${sessionId}` : undefined, - } - } catch (error) { - await transport.close(false).catch(() => {}) - - return { - name: 'session/cancel', - passed: false, - message: `session/cancel failed: ${error instanceof Error ? error.message : String(error)}`, - } - } -} - -/** - * Check: framing - * Verify all messages are valid JSON-RPC 2.0. - * This is implicitly tested by the other checks succeeding. - */ -const checkFraming = (previousChecks: CheckResult[]): CheckResult => { - // If all previous checks passed, framing is valid - const allPassed = previousChecks.every((c) => c.passed) - - if (allPassed) { - return { - name: 'framing', - passed: true, - message: 'All messages valid JSON-RPC 2.0', - } - } - - return { - name: 'framing', - passed: false, - message: 'Some messages failed validation (see above)', - } -} - -// ============================================================================ -// Main Check Runner -// ============================================================================ - -/** - * Run full compliance check against an adapter. - * - * @param config - Check configuration - * @returns Compliance result with all check details - */ -export const runCheck = async (config: CheckConfig): Promise => { - const checks: CheckResult[] = [] - - // Check 1: spawn - const spawnResult = await checkSpawn(config) - checks.push(spawnResult) - - if (!spawnResult.passed) { - // Can't continue if spawn fails - return { - passed: false, - checks, - summary: { total: 6, passed: 0, failed: 1 }, - } - } - - // Check 2: initialize - const { result: initResult, transport, capabilities: _ } = await checkInitialize(config) - checks.push(initResult) - - if (!initResult.passed || !transport) { - return { - passed: false, - checks, - summary: { total: 6, passed: 1, failed: 1 }, - } - } - - // Check 3: session/new - const { result: sessionResult, sessionId } = await checkSessionNew(transport, config.verbose) - checks.push(sessionResult) - - if (!sessionResult.passed || !sessionId) { - await transport.close(false) - return { - passed: false, - checks, - summary: { total: 6, passed: 2, failed: 1 }, - } - } - - // Clean up init transport - we'll create fresh ones for remaining checks - await transport.close(true) - - // Check 4: session/prompt (uses fresh transport) - const promptResult = await checkSessionPrompt(config, sessionId) - checks.push(promptResult) - - // Check 5: session/cancel (uses fresh transport) - const cancelResult = await checkSessionCancel(config, sessionId) - checks.push(cancelResult) - - // Check 6: framing (based on previous results) - const framingResult = checkFraming(checks) - checks.push(framingResult) - - const passed = checks.filter((c) => c.passed).length - const failed = checks.filter((c) => !c.passed).length - - return { - passed: failed === 0, - checks, - summary: { - total: checks.length, - passed, - failed, - }, - } -} - -// ============================================================================ -// CLI Entry Point -// ============================================================================ - -/** - * Adapter check command CLI handler. - * - * @param args - Command line arguments (after 'adapter:check') - */ -export const adapterCheck = async (args: string[]): Promise => { - const { values, positionals } = parseArgs({ - args, - options: { - timeout: { type: 'string', default: '5000' }, - verbose: { type: 'boolean', default: false }, - help: { type: 'boolean', short: 'h' }, - }, - allowPositionals: true, - }) - - if (values.help) { - // biome-ignore lint/suspicious/noConsole: CLI help output - console.log(` -Usage: acp-harness adapter:check [args...] - -Arguments: - command [args] Command to spawn the adapter - -Options: - --timeout Timeout for each check in ms (default: 5000) - --verbose Show detailed protocol messages - -h, --help Show this help message - -Checks Performed: - spawn Adapter can be launched as subprocess - initialize Responds with valid agentCapabilities - session/new Creates session and returns sessionId - session/prompt Accepts prompt and emits updates - session/cancel Accepts cancel notification gracefully - framing All messages are valid JSON-RPC 2.0 - -Examples: - # Check local TypeScript adapter - acp-harness adapter:check bun ./my-adapter/src/main.ts - - # Check with verbose output - acp-harness adapter:check bunx my-adapter --verbose - - # Check Python adapter - acp-harness adapter:check python ./adapter.py -`) - return - } - - if (positionals.length === 0) { - console.error('Error: adapter command is required') - console.error('Example: acp-harness adapter:check bun ./src/main.ts') - process.exit(1) - } - - const command = positionals - - // biome-ignore lint/suspicious/noConsole: CLI output - console.log(`Checking ACP compliance for: ${command.join(' ')}\n`) - - const result = await runCheck({ - command, - timeout: Number.parseInt(values.timeout ?? '5000', 10), - verbose: values.verbose ?? false, - }) - - // Print results - for (const check of result.checks) { - const icon = check.passed ? '\u2713' : '\u2717' - const color = check.passed ? '\x1b[32m' : '\x1b[31m' - const reset = '\x1b[0m' - - // biome-ignore lint/suspicious/noConsole: CLI output - console.log(`${color}${icon}${reset} ${check.name}: ${check.message}`) - - if (check.details && values.verbose) { - // biome-ignore lint/suspicious/noConsole: CLI verbose output - console.log(` ${check.details.split('\n').join('\n ')}`) - } - } - - // biome-ignore lint/suspicious/noConsole: CLI output - console.log( - `\n${result.summary.passed}/${result.summary.total} checks passed.${result.passed ? ' Adapter is ACP-compliant.' : ''}`, - ) - - if (!result.passed) { - process.exit(1) - } -} diff --git a/src/adapter-scaffold.ts b/src/adapter-scaffold.ts deleted file mode 100644 index c0681c4..0000000 --- a/src/adapter-scaffold.ts +++ /dev/null @@ -1,935 +0,0 @@ -/** - * ACP adapter project scaffolding. - * - * @remarks - * Generates boilerplate for new ACP adapter projects with proper structure, - * TypeScript configuration, and example handlers. - * - * Supports TypeScript and Python adapters. - * - * @packageDocumentation - */ - -import { stat } from 'node:fs/promises' -import { join } from 'node:path' -import { parseArgs } from 'node:util' - -// ============================================================================ -// Types -// ============================================================================ - -/** Configuration for scaffold generation */ -export type ScaffoldConfig = { - /** Adapter name (used for package name and directory) */ - name: string - /** Output directory path */ - outputDir: string - /** Language: 'ts' or 'python' */ - lang: 'ts' | 'python' - /** Generate minimal boilerplate only */ - minimal: boolean -} - -/** Result of scaffold operation */ -export type ScaffoldResult = { - /** Output directory path */ - outputDir: string - /** List of created files */ - files: string[] - /** Language used */ - lang: 'ts' | 'python' -} - -// ============================================================================ -// TypeScript Templates -// ============================================================================ - -const tsPackageJson = (name: string): string => `{ - "name": "${name}-acp", - "version": "1.0.0", - "type": "module", - "bin": { - "${name}-acp": "./src/main.ts" - }, - "scripts": { - "start": "bun run src/main.ts", - "check": "bunx @plaited/acp-harness adapter:check bun ./src/main.ts" - }, - "dependencies": { - "@agentclientprotocol/sdk": "^0.0.1" - }, - "devDependencies": { - "@types/bun": "latest", - "typescript": "^5.0.0" - } -} -` - -const tsTsConfig = (): string => `{ - "compilerOptions": { - "target": "ES2022", - "module": "ESNext", - "moduleResolution": "bundler", - "strict": true, - "esModuleInterop": true, - "skipLibCheck": true, - "outDir": "dist", - "declaration": true - }, - "include": ["src"] -} -` - -const tsIndexFile = (name: string): string => `#!/usr/bin/env bun -/** - * ${name} ACP adapter entry point. - * - * This adapter translates between the Agent Client Protocol and - * your agent's native API. - */ - -import { createInterface } from 'node:readline' -import { handleInitialize } from './handlers/initialize.ts' -import { handleSessionNew, handleSessionLoad } from './handlers/session-new.ts' -import { handleSessionPrompt } from './handlers/session-prompt.ts' -import { handleSessionCancel } from './handlers/session-cancel.ts' -import type { JsonRpcRequest, JsonRpcResponse, JsonRpcNotification } from './types.ts' - -// Method handlers -const methodHandlers: Record Promise> = { - initialize: handleInitialize, - 'session/new': handleSessionNew, - 'session/load': handleSessionLoad, - 'session/prompt': handleSessionPrompt, -} - -// Notification handlers (no response expected) -const notificationHandlers: Record Promise> = { - 'session/cancel': handleSessionCancel, -} - -/** - * Send a JSON-RPC message to stdout. - */ -export const sendMessage = (message: JsonRpcResponse | JsonRpcNotification): void => { - // biome-ignore lint/suspicious/noConsole: Protocol output - console.log(JSON.stringify(message)) -} - -/** - * Send a session update notification. - */ -export const sendSessionUpdate = (sessionId: string, update: unknown): void => { - sendMessage({ - jsonrpc: '2.0', - method: 'session/update', - params: { sessionId, update }, - }) -} - -/** - * Process incoming JSON-RPC message. - */ -const processMessage = async (line: string): Promise => { - let request: JsonRpcRequest | JsonRpcNotification - - try { - request = JSON.parse(line) - } catch { - sendMessage({ - jsonrpc: '2.0', - id: null, - error: { code: -32700, message: 'Parse error' }, - }) - return - } - - // Check if it's a notification (no id) - const isNotification = !('id' in request) - - if (isNotification) { - const handler = notificationHandlers[request.method] - if (handler) { - await handler(request.params) - } - // No response for notifications - return - } - - // It's a request - send response - const reqWithId = request as JsonRpcRequest - const handler = methodHandlers[reqWithId.method] - - if (!handler) { - sendMessage({ - jsonrpc: '2.0', - id: reqWithId.id, - error: { code: -32601, message: \`Method not found: \${reqWithId.method}\` }, - }) - return - } - - try { - const result = await handler(reqWithId.params) - sendMessage({ - jsonrpc: '2.0', - id: reqWithId.id, - result, - }) - } catch (error) { - sendMessage({ - jsonrpc: '2.0', - id: reqWithId.id, - error: { - code: -32603, - message: error instanceof Error ? error.message : 'Internal error', - }, - }) - } -} - -// Main loop: read lines from stdin -const rl = createInterface({ - input: process.stdin, - output: process.stdout, - terminal: false, -}) - -rl.on('line', processMessage) - -// Handle clean shutdown -process.on('SIGTERM', () => { - rl.close() - process.exit(0) -}) -` - -const tsTypesFile = (): string => `/** - * TypeScript types for JSON-RPC 2.0 protocol. - */ - -export type JsonRpcRequest = { - jsonrpc: '2.0' - id: string | number - method: string - params?: unknown -} - -export type JsonRpcNotification = { - jsonrpc: '2.0' - method: string - params?: unknown -} - -export type JsonRpcSuccessResponse = { - jsonrpc: '2.0' - id: string | number - result: unknown -} - -export type JsonRpcErrorResponse = { - jsonrpc: '2.0' - id: string | number | null - error: { - code: number - message: string - data?: unknown - } -} - -export type JsonRpcResponse = JsonRpcSuccessResponse | JsonRpcErrorResponse - -export type ContentBlock = - | { type: 'text'; text: string } - | { type: 'image'; source: { type: 'base64'; mediaType: string; data: string } } -` - -const tsInitializeHandler = (name: string): string => `/** - * Initialize handler - protocol handshake. - */ - -type InitializeParams = { - protocolVersion: number - clientInfo: { name: string; version: string } - clientCapabilities: Record -} - -type InitializeResult = { - protocolVersion: number - agentInfo: { name: string; version: string } - agentCapabilities: { - loadSession?: boolean - promptCapabilities?: { - image?: boolean - } - } -} - -export const handleInitialize = async (params: unknown): Promise => { - const { protocolVersion } = params as InitializeParams - - if (protocolVersion !== 1) { - throw new Error(\`Unsupported protocol version: \${protocolVersion}\`) - } - - return { - protocolVersion: 1, - agentInfo: { - name: '${name}', - version: '1.0.0', - }, - agentCapabilities: { - loadSession: false, - promptCapabilities: { - image: false, - }, - }, - } -} -` - -const tsSessionNewHandler = (): string => `/** - * Session handlers - create and load sessions. - */ - -import { sessionManager } from '../session-manager.ts' - -type SessionNewParams = { - cwd: string -} - -type SessionNewResult = { - sessionId: string -} - -export const handleSessionNew = async (params: unknown): Promise => { - const { cwd } = params as SessionNewParams - - // MCP servers are discovered from cwd configuration files - // (e.g., .mcp.json, .gemini/settings.json) - const sessionId = sessionManager.createSession({ cwd }) - - return { sessionId } -} - -type SessionLoadParams = { - sessionId: string -} - -export const handleSessionLoad = async (params: unknown): Promise => { - const { sessionId } = params as SessionLoadParams - - const session = sessionManager.getSession(sessionId) - if (!session) { - throw new Error(\`Session not found: \${sessionId}\`) - } - - return { sessionId } -} -` - -const tsSessionPromptHandler = (): string => `/** - * Session prompt handler - process prompts and emit updates. - */ - -import { sessionManager } from '../session-manager.ts' -import { sendSessionUpdate } from '../main.ts' -import type { ContentBlock } from '../types.ts' - -type PromptParams = { - sessionId: string - prompt: ContentBlock[] -} - -type PromptResult = { - content: ContentBlock[] -} - -export const handleSessionPrompt = async (params: unknown): Promise => { - const { sessionId, prompt } = params as PromptParams - - const session = sessionManager.getSession(sessionId) - if (!session) { - throw new Error(\`Session not found: \${sessionId}\`) - } - - // Extract text from content blocks - const promptText = prompt - .filter((block): block is ContentBlock & { type: 'text' } => block.type === 'text') - .map((block) => block.text) - .join('\\n') - - // Emit thinking update - sendSessionUpdate(sessionId, { - sessionUpdate: 'agent_thought_chunk', - content: { type: 'text', text: 'Processing your request...' }, - }) - - // TODO: Replace with your agent's actual API call - const response = await processWithYourAgent(promptText, session.cwd) - - // Emit message update - sendSessionUpdate(sessionId, { - sessionUpdate: 'agent_message_chunk', - content: { type: 'text', text: response }, - }) - - return { - content: [{ type: 'text', text: response }], - } -} - -/** - * Replace this with your actual agent API call. - */ -const processWithYourAgent = async (prompt: string, _cwd: string): Promise => { - // Example echo implementation - replace with real agent call - return \`Echo: \${prompt}\` -} -` - -const tsSessionCancelHandler = (): string => `/** - * Session cancel handler - cancel ongoing prompts. - */ - -type CancelParams = { - sessionId: string -} - -// Track active requests for cancellation -const activeRequests = new Map() - -export const handleSessionCancel = async (params: unknown): Promise => { - const { sessionId } = params as CancelParams - - const controller = activeRequests.get(sessionId) - if (controller) { - controller.abort() - activeRequests.delete(sessionId) - } -} - -/** - * Register an active request for cancellation support. - */ -export const registerActiveRequest = ( - sessionId: string, - controller: AbortController -): void => { - activeRequests.set(sessionId, controller) -} - -/** - * Unregister an active request after completion. - */ -export const unregisterActiveRequest = (sessionId: string): void => { - activeRequests.delete(sessionId) -} -` - -const tsSessionManager = (): string => `/** - * Session manager - tracks active conversation sessions. - */ - -import { randomUUID } from 'node:crypto' - -type Session = { - id: string - cwd: string - createdAt: Date -} - -class SessionManager { - #sessions = new Map() - - createSession(params: { cwd: string }): string { - const id = \`sess_\${randomUUID().slice(0, 8)}\` - this.#sessions.set(id, { - id, - cwd: params.cwd, - createdAt: new Date(), - }) - return id - } - - getSession(id: string): Session | undefined { - return this.#sessions.get(id) - } - - deleteSession(id: string): boolean { - return this.#sessions.delete(id) - } - - listSessions(): Session[] { - return Array.from(this.#sessions.values()) - } -} - -export const sessionManager = new SessionManager() -` - -const tsReadme = (name: string): string => `# ${name} ACP Adapter - -ACP (Agent Client Protocol) adapter for ${name}. - -## Quick Start - -\`\`\`bash -# Install dependencies -bun install - -# Run the adapter -bun run start - -# Or run directly -bun run src/main.ts -\`\`\` - -## Verify Compliance - -\`\`\`bash -# Run compliance checker -bun run check - -# Or manually -bunx @plaited/acp-harness adapter:check bun ./src/main.ts -\`\`\` - -## Test with Harness - -\`\`\`bash -# Create test prompts -echo '{"id":"test-1","input":"Hello"}' > prompts.jsonl - -# Run capture -bunx @plaited/acp-harness capture prompts.jsonl bun ./src/main.ts -o results.jsonl - -# View results -cat results.jsonl | jq . -\`\`\` - -## Implementation - -Replace the placeholder in \`src/handlers/session-prompt.ts\`: - -\`\`\`typescript -const processWithYourAgent = async (prompt: string, cwd: string): Promise => { - // Call your agent's API here - const response = await yourAgentClient.chat(prompt) - return response.text -} -\`\`\` - -## Protocol Reference - -See the [ACP Specification](https://agentclientprotocol.org) for protocol details. -` - -// ============================================================================ -// Python Templates -// ============================================================================ - -const pythonAdapter = (name: string): string => `#!/usr/bin/env python3 -""" -${name} ACP adapter. - -ACP (Agent Client Protocol) adapter for ${name}. -Translates between JSON-RPC 2.0 and your agent's native API. -""" - -import json -import sys -import uuid -from typing import Any, Dict, Optional - -# Session storage -sessions: Dict[str, Dict[str, Any]] = {} - - -def create_session(cwd: str) -> str: - """Create a new session. - - MCP servers are discovered from cwd configuration files. - """ - session_id = f"sess_{uuid.uuid4().hex[:8]}" - sessions[session_id] = { - "id": session_id, - "cwd": cwd, - } - return session_id - - -def get_session(session_id: str) -> Optional[Dict[str, Any]]: - """Get session by ID.""" - return sessions.get(session_id) - - -def send_message(message: Dict[str, Any]) -> None: - """Send JSON-RPC message to stdout.""" - print(json.dumps(message), flush=True) - - -def send_session_update(session_id: str, update: Dict[str, Any]) -> None: - """Send session update notification.""" - send_message({ - "jsonrpc": "2.0", - "method": "session/update", - "params": {"sessionId": session_id, "update": update}, - }) - - -def handle_initialize(params: Dict[str, Any]) -> Dict[str, Any]: - """Handle initialize request.""" - protocol_version = params.get("protocolVersion", 0) - if protocol_version != 1: - raise ValueError(f"Unsupported protocol version: {protocol_version}") - - return { - "protocolVersion": 1, - "agentInfo": {"name": "${name}", "version": "1.0.0"}, - "agentCapabilities": { - "loadSession": False, - "promptCapabilities": {"image": False}, - }, - } - - -def handle_session_new(params: Dict[str, Any]) -> Dict[str, Any]: - """Handle session/new request. - - MCP servers are discovered from cwd configuration files - (e.g., .mcp.json, .gemini/settings.json). - """ - cwd = params.get("cwd", ".") - session_id = create_session(cwd) - return {"sessionId": session_id} - - -def handle_session_prompt(params: Dict[str, Any]) -> Dict[str, Any]: - """Handle session/prompt request.""" - session_id = params["sessionId"] - session = get_session(session_id) - if not session: - raise ValueError(f"Session not found: {session_id}") - - # Extract text from prompt blocks - prompt_text = " ".join( - block["text"] - for block in params.get("prompt", []) - if block.get("type") == "text" - ) - - # Send thinking update - send_session_update(session_id, { - "sessionUpdate": "agent_thought_chunk", - "content": {"type": "text", "text": "Processing your request..."}, - }) - - # TODO: Replace with your agent's actual API call - response = process_with_your_agent(prompt_text, session["cwd"]) - - # Send message update - send_session_update(session_id, { - "sessionUpdate": "agent_message_chunk", - "content": {"type": "text", "text": response}, - }) - - return {"content": [{"type": "text", "text": response}]} - - -def process_with_your_agent(prompt: str, cwd: str) -> str: - """Replace with your actual agent API call.""" - return f"Echo: {prompt}" - - -# Method handlers -METHOD_HANDLERS = { - "initialize": handle_initialize, - "session/new": handle_session_new, - "session/prompt": handle_session_prompt, -} - - -def process_message(line: str) -> None: - """Process incoming JSON-RPC message.""" - try: - request = json.loads(line) - except json.JSONDecodeError: - send_message({ - "jsonrpc": "2.0", - "id": None, - "error": {"code": -32700, "message": "Parse error"}, - }) - return - - # Check if notification (no id) - if "id" not in request: - # Handle notification silently - return - - method = request.get("method", "") - handler = METHOD_HANDLERS.get(method) - - if not handler: - send_message({ - "jsonrpc": "2.0", - "id": request["id"], - "error": {"code": -32601, "message": f"Method not found: {method}"}, - }) - return - - try: - result = handler(request.get("params", {})) - send_message({ - "jsonrpc": "2.0", - "id": request["id"], - "result": result, - }) - except Exception as e: - send_message({ - "jsonrpc": "2.0", - "id": request["id"], - "error": {"code": -32603, "message": str(e)}, - }) - - -def main() -> None: - """Main loop: read lines from stdin.""" - for line in sys.stdin: - line = line.strip() - if line: - process_message(line) - - -if __name__ == "__main__": - main() -` - -const pythonReadme = (name: string): string => `# ${name} ACP Adapter - -ACP (Agent Client Protocol) adapter for ${name} (Python). - -## Quick Start - -\`\`\`bash -# Make executable -chmod +x adapter.py - -# Run the adapter -python adapter.py -\`\`\` - -## Verify Compliance - -\`\`\`bash -bunx @plaited/acp-harness adapter:check python ./adapter.py -\`\`\` - -## Test with Harness - -\`\`\`bash -# Create test prompts -echo '{"id":"test-1","input":"Hello"}' > prompts.jsonl - -# Run capture -bunx @plaited/acp-harness capture prompts.jsonl python ./adapter.py -o results.jsonl - -# View results -cat results.jsonl | jq . -\`\`\` - -## Implementation - -Replace the placeholder in \`adapter.py\`: - -\`\`\`python -def process_with_your_agent(prompt: str, cwd: str) -> str: - # Call your agent's API here - response = your_agent_client.chat(prompt) - return response.text -\`\`\` - -## Protocol Reference - -See the [ACP Specification](https://agentclientprotocol.org) for protocol details. -` - -// ============================================================================ -// Scaffold Implementation -// ============================================================================ - -/** - * Generate TypeScript adapter project. - */ -const scaffoldTypeScript = async (config: ScaffoldConfig): Promise => { - const { name, outputDir, minimal } = config - const files: string[] = [] - - // Create directories - await Bun.write(join(outputDir, 'src', 'handlers', '.gitkeep'), '') - - // Core files - await Bun.write(join(outputDir, 'package.json'), tsPackageJson(name)) - files.push('package.json') - - await Bun.write(join(outputDir, 'tsconfig.json'), tsTsConfig()) - files.push('tsconfig.json') - - await Bun.write(join(outputDir, 'src', 'main.ts'), tsIndexFile(name)) - files.push('src/main.ts') - - await Bun.write(join(outputDir, 'src', 'types.ts'), tsTypesFile()) - files.push('src/types.ts') - - await Bun.write(join(outputDir, 'src', 'session-manager.ts'), tsSessionManager()) - files.push('src/session-manager.ts') - - // Handler files - await Bun.write(join(outputDir, 'src', 'handlers', 'initialize.ts'), tsInitializeHandler(name)) - files.push('src/handlers/initialize.ts') - - await Bun.write(join(outputDir, 'src', 'handlers', 'session-new.ts'), tsSessionNewHandler()) - files.push('src/handlers/session-new.ts') - - await Bun.write(join(outputDir, 'src', 'handlers', 'session-prompt.ts'), tsSessionPromptHandler()) - files.push('src/handlers/session-prompt.ts') - - await Bun.write(join(outputDir, 'src', 'handlers', 'session-cancel.ts'), tsSessionCancelHandler()) - files.push('src/handlers/session-cancel.ts') - - // README (unless minimal) - if (!minimal) { - await Bun.write(join(outputDir, 'README.md'), tsReadme(name)) - files.push('README.md') - } - - return files -} - -/** - * Generate Python adapter project. - */ -const scaffoldPython = async (config: ScaffoldConfig): Promise => { - const { name, outputDir, minimal } = config - const files: string[] = [] - - await Bun.write(join(outputDir, 'adapter.py'), pythonAdapter(name)) - files.push('adapter.py') - - if (!minimal) { - await Bun.write(join(outputDir, 'README.md'), pythonReadme(name)) - files.push('README.md') - } - - return files -} - -/** - * Run adapter scaffolding with configuration object. - * - * @param config - Scaffold configuration - * @returns Scaffold result with created files - */ -export const runScaffold = async (config: ScaffoldConfig): Promise => { - const { outputDir, lang } = config - - // Create output directory - await Bun.write(join(outputDir, '.gitkeep'), '') - - const files = lang === 'python' ? await scaffoldPython(config) : await scaffoldTypeScript(config) - - return { - outputDir, - files, - lang, - } -} - -// ============================================================================ -// CLI Entry Point -// ============================================================================ - -/** - * Adapter scaffold command CLI handler. - * - * @param args - Command line arguments (after 'adapter:scaffold') - */ -export const adapterScaffold = async (args: string[]): Promise => { - const { values, positionals } = parseArgs({ - args, - options: { - output: { type: 'string', short: 'o' }, - lang: { type: 'string', default: 'ts' }, - minimal: { type: 'boolean', default: false }, - help: { type: 'boolean', short: 'h' }, - }, - allowPositionals: true, - }) - - if (values.help) { - // biome-ignore lint/suspicious/noConsole: CLI help output - console.log(` -Usage: acp-harness adapter:scaffold [name] [options] - -Arguments: - name Adapter name (used for package name) - -Options: - -o, --output Output directory (default: ./-acp) - --lang Language: ts or python (default: ts) - --minimal Generate minimal boilerplate only - -h, --help Show this help message - -Examples: - # Scaffold TypeScript adapter - acp-harness adapter:scaffold my-agent - - # Scaffold Python adapter - acp-harness adapter:scaffold my-agent --lang python - - # Scaffold to specific directory - acp-harness adapter:scaffold my-agent -o ./adapters/my-agent -`) - return - } - - const name = positionals[0] - if (!name) { - console.error('Error: adapter name is required') - console.error('Example: acp-harness adapter:scaffold my-agent') - process.exit(1) - } - - const lang = values.lang === 'python' ? 'python' : 'ts' - const outputDir = values.output ?? `./${name}-acp` - - // Check if directory already exists - const dirExists = await stat(outputDir).catch(() => null) - if (dirExists) { - console.error(`Error: directory already exists: ${outputDir}`) - process.exit(1) - } - - const result = await runScaffold({ - name, - outputDir, - lang, - minimal: values.minimal ?? false, - }) - - // biome-ignore lint/suspicious/noConsole: CLI output - console.log(` -Scaffolded ${result.lang === 'ts' ? 'TypeScript' : 'Python'} adapter: ${name} - -Created files: -${result.files.map((f) => ` ${result.outputDir}/${f}`).join('\n')} - -Next steps: - cd ${result.outputDir} -${result.lang === 'ts' ? ' bun install' : ' chmod +x adapter.py'} -${result.lang === 'ts' ? ' bun run start' : ' python adapter.py'} - -Verify compliance: - acp-harness adapter:check ${result.lang === 'ts' ? 'bun ./src/main.ts' : 'python ./adapter.py'} -`) -} diff --git a/src/capture.ts b/src/capture.ts index c966b5f..7d54749 100644 --- a/src/capture.ts +++ b/src/capture.ts @@ -2,7 +2,7 @@ * Core trajectory capture command. * * @remarks - * Executes prompts against an ACP agent and captures full trajectories. + * Executes prompts against a CLI agent and captures full trajectories. * This is the foundational command - all other views derive from its output. * * Output format is always full trajectory JSONL (`CaptureResultSchema`). @@ -13,13 +13,13 @@ import { appendFile } from 'node:fs/promises' import { parseArgs } from 'node:util' -import type { SessionNotification, ToolCall } from '@agentclientprotocol/sdk' -import { createACPClient } from './acp-client.ts' -import { createPrompt } from './acp-helpers.ts' import { DEFAULT_HARNESS_TIMEOUT, HEAD_LINES, TAIL_LINES } from './constants.ts' import { loadGrader } from './grader-loader.ts' +import { type HeadlessAdapterConfig, parseHeadlessConfig } from './headless.schemas.ts' +import type { ParsedUpdate } from './headless-output-parser.ts' +import { createSessionManager, type ProcessExitInfo, type PromptResult } from './headless-session-manager.ts' import type { CaptureResult, Grader, PromptCase, TrajectoryRichness, TrajectoryStep } from './schemas.ts' -import { PromptCaseSchema, TokenUsageSchema, ToolInputSchema } from './schemas.ts' +import { PromptCaseSchema, ToolInputSchema } from './schemas.ts' // ============================================================================ // Types @@ -29,13 +29,13 @@ import { PromptCaseSchema, TokenUsageSchema, ToolInputSchema } from './schemas.t export type CaptureConfig = { /** Path to prompts.jsonl file */ promptsPath: string - /** ACP agent command (e.g., ['bunx', 'claude-code-acp']) */ - agentCommand: string[] + /** Path to agent schema JSON file */ + schemaPath: string /** Output file path (undefined for stdout) */ outputPath?: string /** Working directory for agent */ cwd?: string - /** Timeout per prompt in milliseconds */ + /** Timeout per prompt in milliseconds (overrides schema default) */ timeout?: number /** Show progress to stderr */ progress?: boolean @@ -43,6 +43,8 @@ export type CaptureConfig = { append?: boolean /** Optional grader function */ grader?: Grader + /** Enable debug mode for detailed output */ + debug?: boolean } // ============================================================================ @@ -65,57 +67,49 @@ export const loadPrompts = async (path: string): Promise => { }) } -/** Extract trajectory from session notifications */ -export const extractTrajectory = (notifications: SessionNotification[], startTime: number): TrajectoryStep[] => { +/** Extract trajectory from parsed updates */ +export const extractTrajectory = (updates: ParsedUpdate[], startTime: number): TrajectoryStep[] => { const trajectory: TrajectoryStep[] = [] const toolCallMap = new Map() - for (const notification of notifications) { + for (const update of updates) { const timestamp = Date.now() - startTime - const update = notification.update - if (update.sessionUpdate === 'agent_thought_chunk' && update.content.type === 'text') { + if (update.type === 'thought') { trajectory.push({ type: 'thought', - content: update.content.text, + content: update.content ?? '', timestamp, }) - } else if (update.sessionUpdate === 'agent_message_chunk' && update.content.type === 'text') { + } else if (update.type === 'message') { trajectory.push({ type: 'message', - content: update.content.text, + content: update.content ?? '', timestamp, }) - } else if (update.sessionUpdate === 'tool_call') { - const toolCall = update as ToolCall - const existing = toolCallMap.get(toolCall.toolCallId) + } else if (update.type === 'tool_call') { + const toolCallId = update.title ?? `tool_${Date.now()}` + const existing = toolCallMap.get(toolCallId) - if (existing) { + if (existing && update.status === 'completed') { // Update existing tool call with completion info - existing.step.status = toolCall.status ?? 'pending' - if (toolCall.content) { - existing.step.output = toolCall.content - } - if (toolCall.rawOutput) { - existing.step.output = toolCall.rawOutput - } + existing.step.status = update.status existing.step.duration = timestamp - existing.start - } else { + } else if (!existing) { // New tool call const step: TrajectoryStep & { type: 'tool_call' } = { type: 'tool_call', - name: toolCall.title, - status: toolCall.status ?? 'pending', - input: toolCall.rawInput, + name: update.title ?? 'unknown', + status: update.status ?? 'pending', timestamp, } - toolCallMap.set(toolCall.toolCallId, { start: timestamp, step }) + toolCallMap.set(toolCallId, { start: timestamp, step }) trajectory.push(step) } - } else if (update.sessionUpdate === 'plan') { + } else if (update.type === 'plan') { trajectory.push({ type: 'plan', - entries: update.entries, + entries: [], timestamp, }) } @@ -217,37 +211,6 @@ export const detectTrajectoryRichness = (trajectory: TrajectoryStep[]): Trajecto return hasMessages ? 'messages-only' : 'minimal' } -/** - * Extract token counts from session notifications if available. - * - * @remarks - * Token usage is adapter-dependent. If the adapter doesn't expose usage, - * these fields will be undefined. Uses Zod validation for runtime type safety. - */ -export const extractTokenCounts = (updates: SessionNotification[]): { inputTokens?: number; outputTokens?: number } => { - let inputTokens: number | undefined - let outputTokens: number | undefined - - for (const update of updates) { - // Check for token usage in update (adapter-specific) - // ACP SDK doesn't declare 'usage' field, but adapters extend it at runtime - const updateRecord = update as Record - const usageData = updateRecord.usage ?? (updateRecord.update as Record | undefined)?.usage - const usage = TokenUsageSchema.safeParse(usageData) - - if (usage.success) { - if (usage.data.inputTokens !== undefined) { - inputTokens = (inputTokens ?? 0) + usage.data.inputTokens - } - if (usage.data.outputTokens !== undefined) { - outputTokens = (outputTokens ?? 0) + usage.data.outputTokens - } - } - } - - return { inputTokens, outputTokens } -} - /** Get preview text for input (handles string or array) */ const getInputPreview = (input: string | string[]): string => { if (Array.isArray(input)) { @@ -274,33 +237,57 @@ const getInputPreview = (input: string | string[]): string => { export const runCapture = async (config: CaptureConfig): Promise => { const { promptsPath, - agentCommand, + schemaPath, outputPath, cwd, - timeout = DEFAULT_HARNESS_TIMEOUT, + timeout, progress = false, append = false, grader, + debug = false, } = config + // Load and validate schema + const schemaFile = Bun.file(schemaPath) + if (!(await schemaFile.exists())) { + throw new Error(`Schema file not found: ${schemaPath}`) + } + + let schema: HeadlessAdapterConfig + try { + const rawSchema = await schemaFile.json() + schema = parseHeadlessConfig(rawSchema) + } catch (error) { + throw new Error(`Invalid schema: ${error instanceof Error ? error.message : String(error)}`) + } + // Load prompts const prompts = await loadPrompts(promptsPath) // Resolve output path const resolvedOutputPath = outputPath ? resolvePath(outputPath) : undefined + // Determine effective timeout (CLI flag > schema default > harness default) + const schemaTimeout = 'timeout' in schema ? schema.timeout : undefined + const effectiveTimeout = timeout ?? schemaTimeout ?? DEFAULT_HARNESS_TIMEOUT + // Log progress info logProgress(`Loaded ${prompts.length} prompts from ${promptsPath}`, progress) - logProgress(`Command: ${agentCommand.join(' ')}`, progress) + logProgress(`Schema: ${schema.name} (${schemaPath})`, progress) + logProgress(`Timeout: ${effectiveTimeout}ms`, progress) if (resolvedOutputPath) { logProgress(`Output: ${resolvedOutputPath}`, progress) } + if (debug) { + logProgress(`Debug mode: enabled`, progress) + } - // Create ACP client - const client = createACPClient({ - command: agentCommand, - cwd, - timeout, + // Create session manager with schema + const sessions = createSessionManager({ + schema, + timeout: effectiveTimeout, + verbose: progress, + debug, }) // Clear output file if not appending @@ -308,130 +295,135 @@ export const runCapture = async (config: CaptureConfig): Promise 0 ? trajectory[0]?.timestamp : undefined, + sessionCreation, + total: endTime - startTime, + }, + toolErrors, + } - result = { - id: promptCase.id, - input: promptCase.input, // Preserve original (string or array) + // Apply grader if provided + if (grader) { + result.score = await grader({ + input: promptCase.input, output, - ...(promptCase.hint && { hint: promptCase.hint }), + hint: promptCase.hint, trajectory, - metadata: { - ...promptCase.metadata, - agent: agentCommand.join(' '), - trajectoryRichness, - turnCount, - }, - timing: { - start: startTime, - end: endTime, - firstResponse: trajectory.length > 0 ? trajectory[0]?.timestamp : undefined, - sessionCreation, - total: endTime - startTime, - ...(tokenCounts.inputTokens !== undefined && { inputTokens: tokenCounts.inputTokens }), - ...(tokenCounts.outputTokens !== undefined && { outputTokens: tokenCounts.outputTokens }), - }, - toolErrors, - } - - // Apply grader if provided - if (grader) { - result.score = await grader({ - input: promptCase.input, - output, - hint: promptCase.hint, - trajectory, - }) - } - } catch (error) { - const endTime = Date.now() - const message = error instanceof Error ? error.message : String(error) - const inputs = Array.isArray(promptCase.input) ? promptCase.input : [promptCase.input] + }) + } - result = { - id: promptCase.id, - input: promptCase.input, - output: '', - trajectory: [], - metadata: { - ...promptCase.metadata, - agent: agentCommand.join(' '), - trajectoryRichness: 'minimal' as TrajectoryRichness, - turnCount: inputs.length, - }, - timing: { - start: startTime, - end: endTime, - sessionCreation: 0, - total: endTime - startTime, - }, - toolErrors: true, - errors: [message], - } + // Clean up session + sessions.destroy(session.id) + } catch (error) { + const endTime = Date.now() + const message = error instanceof Error ? error.message : String(error) + const inputs = Array.isArray(promptCase.input) ? promptCase.input : [promptCase.input] + + result = { + id: promptCase.id, + input: promptCase.input, + output: '', + trajectory: [], + metadata: { + ...promptCase.metadata, + agent: schema.name, + trajectoryRichness: 'minimal' as TrajectoryRichness, + turnCount: inputs.length, + }, + timing: { + start: startTime, + end: endTime, + sessionCreation: 0, + total: endTime - startTime, + }, + toolErrors: true, + errors: [message], } + } - results.push(result) + results.push(result) - // Write result immediately - const formatted = JSON.stringify(result) - await writeOutput(formatted, resolvedOutputPath, !isFirstOutput) - isFirstOutput = false + // Write result immediately + const formatted = JSON.stringify(result) + await writeOutput(formatted, resolvedOutputPath, !isFirstOutput) + isFirstOutput = false - const statusIcon = result.toolErrors ? '!' : '✓' - logProgress(` ${statusIcon} (${result.timing.total}ms)`, progress) - } - } finally { - logProgress('Disconnecting...', progress) - await client.disconnect() + const statusIcon = result.toolErrors ? '!' : '✓' + const exitInfo = result.metadata?.timedOut + ? ' - TIMEOUT' + : result.metadata?.exitCode && result.metadata.exitCode !== 0 + ? ` - exit ${result.metadata.exitCode}` + : '' + logProgress(` ${statusIcon} (${result.timing.total}ms)${exitInfo}`, progress) } logProgress('Done!', progress) @@ -451,12 +443,14 @@ export const capture = async (args: string[]): Promise => { const { values, positionals } = parseArgs({ args, options: { + schema: { type: 'string', short: 's' }, output: { type: 'string', short: 'o' }, cwd: { type: 'string', short: 'c' }, - timeout: { type: 'string', short: 't', default: String(DEFAULT_HARNESS_TIMEOUT) }, + timeout: { type: 'string', short: 't' }, progress: { type: 'boolean', default: false }, append: { type: 'boolean', default: false }, grader: { type: 'string', short: 'g' }, + debug: { type: 'boolean', default: false }, help: { type: 'boolean', short: 'h' }, }, allowPositionals: true, @@ -465,38 +459,47 @@ export const capture = async (args: string[]): Promise => { if (values.help) { // biome-ignore lint/suspicious/noConsole: CLI help output console.log(` -Usage: acp-harness capture [args...] [options] +Usage: agent-eval-harness capture --schema [options] Arguments: prompts.jsonl Input file with evaluation prompts - command [args] ACP agent command to execute Options: + -s, --schema Path to agent schema JSON file (required) -o, --output Output file (default: stdout) - -c, --cwd Working directory for agent (agents auto-discover MCP configs from here) - -t, --timeout Request timeout in ms (default: ${DEFAULT_HARNESS_TIMEOUT}) + -c, --cwd Working directory for agent + -t, --timeout Request timeout in ms (overrides schema default) --progress Show progress to stderr --append Append to output file instead of overwriting -g, --grader Path to grader (.ts/.js module or executable script) + --debug Enable debug mode (shows raw output, JSONPath matching) -h, --help Show this help message Output Format: Full trajectory JSONL with toolErrors indicator. - Use 'acp-harness summarize' to derive compact views. + Use 'agent-eval-harness summarize' to derive compact views. + +Exit Info (in metadata): + exitCode Process exit code (null if killed/timed out) + signal Signal that killed process (if any) + timedOut true if process was killed due to timeout Graders: TS/JS modules must export a 'grade' function. Executable scripts (Python, etc.) use stdin/stdout JSON protocol. Examples: - # Basic capture - acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl + # Basic capture with schema + agent-eval-harness capture prompts.jsonl --schema claude.json -o results.jsonl # With TypeScript grader - acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.ts -o results.jsonl + agent-eval-harness capture prompts.jsonl -s claude.json --grader ./grader.ts -o results.jsonl + + # With debug mode + agent-eval-harness capture prompts.jsonl -s claude.json --debug -o results.jsonl - # With Python grader - acp-harness capture prompts.jsonl bunx claude-code-acp --grader ./grader.py -o results.jsonl + # With per-prompt timeout override (in prompts.jsonl): + {"id": "slow-task", "input": "...", "timeout": 180000} `) return } @@ -507,10 +510,9 @@ Examples: process.exit(1) } - const agentCommand = positionals.slice(1) - if (agentCommand.length === 0) { - console.error('Error: ACP agent command is required') - console.error('Example: acp-harness capture prompts.jsonl bunx claude-code-acp') + if (!values.schema) { + console.error('Error: --schema is required') + console.error('Example: agent-eval-harness capture prompts.jsonl --schema ./claude.json') process.exit(1) } @@ -527,12 +529,13 @@ Examples: await runCapture({ promptsPath, - agentCommand, + schemaPath: values.schema, outputPath: values.output, cwd: values.cwd, - timeout: Number.parseInt(values.timeout ?? String(DEFAULT_HARNESS_TIMEOUT), 10), + timeout: values.timeout ? Number.parseInt(values.timeout, 10) : undefined, progress: values.progress ?? false, append: values.append ?? false, grader, + debug: values.debug ?? false, }) } diff --git a/src/headless-cli.ts b/src/headless-cli.ts index e5370a7..b6073a3 100644 --- a/src/headless-cli.ts +++ b/src/headless-cli.ts @@ -1,15 +1,15 @@ #!/usr/bin/env bun /** - * Headless ACP adapter factory CLI entry point. + * Headless adapter factory CLI entry point. * * @remarks - * This module implements a schema-driven ACP adapter that can interact with + * This module implements a schema-driven adapter that can interact with * ANY headless CLI agent. The adapter: * * 1. Reads a JSON schema defining how to interact with the CLI * 2. Spawns the CLI process per schema's command + flags * 3. Parses stdout using schema's outputEvents mappings - * 4. Emits ACP session/update notifications + * 4. Emits session update notifications * 5. Manages session state for multi-turn (stream or iterative mode) * * @packageDocumentation @@ -359,7 +359,7 @@ export const headless = async (args: string[]): Promise => { if (values.help) { // biome-ignore lint/suspicious/noConsole: CLI help output console.log(` -Usage: acp-harness headless --schema [--verbose] +Usage: agent-eval-harness headless --schema [--verbose] Arguments: -s, --schema Path to headless adapter schema (JSON) @@ -367,9 +367,9 @@ Arguments: -h, --help Show this help message Description: - Schema-driven ACP adapter for ANY headless CLI agent. The adapter reads + Schema-driven adapter for ANY headless CLI agent. The adapter reads a JSON schema defining how to interact with the CLI and translates between - ACP protocol and CLI stdio. + protocol and CLI stdio. Schema Format: { @@ -385,23 +385,17 @@ Schema Format: Examples: # Run with Claude headless schema - acp-harness headless --schema ./claude-headless.json + agent-eval-harness headless --schema ./claude-headless.json # Use in capture pipeline - acp-harness capture prompts.jsonl \\ - acp-harness headless --schema ./claude-headless.json \\ - -o results.jsonl - - # Validate adapter compliance - acp-harness adapter:check \\ - acp-harness headless --schema ./gemini-headless.json + agent-eval-harness capture prompts.jsonl --schema ./claude-headless.json -o results.jsonl `) return } if (!values.schema) { console.error('Error: --schema is required') - console.error('Example: acp-harness headless --schema ./my-agent.json') + console.error('Example: agent-eval-harness headless --schema ./my-agent.json') process.exit(1) } diff --git a/src/headless-session-manager.ts b/src/headless-session-manager.ts index 40a266f..19aa8d3 100644 --- a/src/headless-session-manager.ts +++ b/src/headless-session-manager.ts @@ -38,6 +38,16 @@ export type Session = { turnCount: number } +/** Process exit information for debugging */ +export type ProcessExitInfo = { + /** Exit code (null if killed by signal or timed out) */ + exitCode: number | null + /** Signal that killed the process (if any) */ + signal?: string + /** Whether the process was killed due to timeout */ + timedOut: boolean +} + /** Update callback for emitting ACP session updates */ export type UpdateCallback = (update: ParsedUpdate) => void @@ -49,16 +59,27 @@ export type PromptResult = { updates: ParsedUpdate[] /** Session ID from CLI (if available) */ cliSessionId?: string + /** Process exit information */ + exitInfo?: ProcessExitInfo } /** Session manager configuration */ export type SessionManagerConfig = { /** Headless adapter configuration */ schema: HeadlessAdapterConfig - /** Default timeout for operations in ms */ + /** Default timeout for operations in ms (overrides schema timeout) */ timeout?: number - /** Whether to show debug output (constructed commands) */ + /** Whether to show debug output (constructed commands, raw stdout) */ verbose?: boolean + /** + * Debug mode - shows detailed output for troubleshooting. + * When enabled: + * - Raw CLI stdout/stderr is logged + * - JSONPath match attempts and results are shown + * - Process spawn/exit info is displayed + * - Timing for each stage is reported + */ + debug?: boolean } // ============================================================================ @@ -86,10 +107,26 @@ export type SessionManagerConfig = { * @returns Session manager with create, prompt, and cancel methods */ export const createSessionManager = (config: SessionManagerConfig) => { - const { schema, timeout = 60000, verbose = false } = config + const { schema, verbose = false, debug = false } = config + // Use schema timeout if available, otherwise default to 60000ms + const schemaTimeout = 'timeout' in schema ? (schema.timeout ?? 60000) : 60000 + const timeout = config.timeout ?? schemaTimeout const sessions = new Map() const outputParser = createOutputParser(schema) + /** + * Debug logging helper - only logs when debug mode is enabled. + */ + const debugLog = (category: string, message: string, data?: unknown): void => { + if (debug) { + const timestamp = new Date().toISOString() + console.error(`[${timestamp}] [${category}] ${message}`) + if (data !== undefined) { + console.error(JSON.stringify(data, null, 2)) + } + } + } + /** * Creates a new session. * @@ -108,8 +145,16 @@ export const createSessionManager = (config: SessionManagerConfig) => { // Initialize mode-specific state if (schema.sessionMode === 'iterative') { + // Normalize historyTemplate: v2 schemas can have object format, convert to string + let templateString: string | undefined + if (typeof schema.historyTemplate === 'object' && schema.historyTemplate !== null) { + // Use turnFormat from object-style template + templateString = schema.historyTemplate.turnFormat + } else { + templateString = schema.historyTemplate + } session.history = createHistoryBuilder({ - template: schema.historyTemplate, + template: templateString, }) } @@ -190,7 +235,7 @@ export const createSessionManager = (config: SessionManagerConfig) => { } } - return collectOutput(session, outputParser, onUpdate, timeout) + return collectOutput(session, outputParser, onUpdate, timeout, debugLog) } /** @@ -221,7 +266,7 @@ export const createSessionManager = (config: SessionManagerConfig) => { writePromptToStdin(session.process, fullPrompt, true) } - const result = await collectOutput(session, outputParser, onUpdate, timeout) + const result = await collectOutput(session, outputParser, onUpdate, timeout, debugLog) // Store in history for next turn session.history?.addTurn(promptText, result.output) @@ -269,7 +314,7 @@ export const createSessionManager = (config: SessionManagerConfig) => { } // Debug output: show constructed command - if (verbose) { + if (verbose || debug) { const stdinNote = schema.prompt.stdin ? ' (+ stdin)' : '' console.error(`[headless] Command: ${args.join(' ')}${stdinNote}`) } @@ -374,19 +419,22 @@ const writePromptToStdin = (process: Subprocess, prompt: string, closeAfterWrite * @param session - Active session * @param parser - Output parser * @param onUpdate - Update callback - * @param timeout - Timeout in ms + * @param timeoutMs - Timeout in ms + * @param logDebug - Debug logging function * @returns Collected output and updates */ const collectOutput = async ( session: Session, parser: OutputParser, onUpdate: UpdateCallback | undefined, - timeout: number, + timeoutMs: number, + logDebug: (category: string, message: string, data?: unknown) => void, ): Promise => { const updates: ParsedUpdate[] = [] let output = '' let cliSessionId: string | undefined const accumulatedMessages: string[] = [] + let timedOut = false const stdout = session.process?.stdout if (!stdout || typeof stdout === 'number') { @@ -397,18 +445,29 @@ const collectOutput = async ( const decoder = new TextDecoder() let buffer = '' - const timeoutPromise = new Promise((_, reject) => { - setTimeout(() => reject(new Error(`Prompt timed out after ${timeout}ms`)), timeout) + // Track timeout with a timer ID so we can clear it + let timeoutId: Timer | undefined + + const timeoutPromise = new Promise<'timeout'>((resolve) => { + timeoutId = setTimeout(() => resolve('timeout'), timeoutMs) }) + logDebug('process', `Starting output collection with ${timeoutMs}ms timeout`) + try { - const readLoop = async () => { + const readLoop = async (): Promise<'complete'> => { readLines: while (true) { const { done, value } = await reader.read() - if (done) break + if (done) { + logDebug('process', 'Process stdout closed') + break + } - buffer += decoder.decode(value, { stream: true }) + const chunk = decoder.decode(value, { stream: true }) + logDebug('raw', `Received ${chunk.length} bytes`) + + buffer += chunk // Process complete lines const lines = buffer.split('\n') @@ -417,6 +476,8 @@ const collectOutput = async ( for (const line of lines) { if (!line.trim()) continue + logDebug('line', `Processing line: ${line.slice(0, 100)}${line.length > 100 ? '...' : ''}`) + // Parse as update first (so updates are emitted even for result lines) const update = parser.parseLine(line) if (update !== null) { @@ -424,6 +485,12 @@ const collectOutput = async ( const updatesToProcess = Array.isArray(update) ? update : [update] for (const singleUpdate of updatesToProcess) { + logDebug('parse', `Matched event: ${singleUpdate.type}`, { + title: singleUpdate.title, + status: singleUpdate.status, + content: singleUpdate.content?.slice(0, 50), + }) + updates.push(singleUpdate) onUpdate?.(singleUpdate) @@ -438,35 +505,81 @@ const collectOutput = async ( if (typeof raw.session_id === 'string') { cliSessionId = raw.session_id session.cliSessionId = cliSessionId + logDebug('session', `Extracted CLI session ID: ${cliSessionId}`) } } } + } else { + logDebug('parse', 'No matching event mapping for line') } // Check for final result (after emitting update) const resultCheck = parser.parseResult(line) if (resultCheck.isResult) { output = resultCheck.content + logDebug('result', `Found result: ${output.slice(0, 100)}${output.length > 100 ? '...' : ''}`) break readLines // Exit both loops immediately on result } } } + return 'complete' } - await Promise.race([readLoop(), timeoutPromise]) + const raceResult = await Promise.race([readLoop(), timeoutPromise]) + + if (raceResult === 'timeout') { + timedOut = true + logDebug('timeout', `Process timed out after ${timeoutMs}ms`) + + // Kill the process on timeout + if (session.process && !session.process.killed) { + session.process.kill('SIGTERM') + logDebug('process', 'Sent SIGTERM to process') + } + } } finally { + if (timeoutId) { + clearTimeout(timeoutId) + } reader.releaseLock() } // Fallback: if result contentPath didn't yield output, use accumulated messages if (!output && accumulatedMessages.length > 0) { output = accumulatedMessages.join('\n') + logDebug('fallback', `Using accumulated messages as output (${accumulatedMessages.length} messages)`) + } + + // Get exit info from process + let exitInfo: ProcessExitInfo | undefined + if (session.process) { + try { + // Wait for process to exit (with a short timeout to not block) + const exitCode = await Promise.race([ + session.process.exited, + new Promise((resolve) => setTimeout(() => resolve(null), 1000)), + ]) + + exitInfo = { + exitCode: exitCode, + timedOut, + signal: timedOut ? 'SIGTERM' : undefined, + } + + logDebug('exit', `Process exit info`, exitInfo) + } catch { + exitInfo = { + exitCode: null, + timedOut, + } + } } return { output, updates, cliSessionId, + exitInfo, } } diff --git a/src/headless.schemas.ts b/src/headless.schemas.ts index a27c63f..f1b8a0e 100644 --- a/src/headless.schemas.ts +++ b/src/headless.schemas.ts @@ -160,9 +160,62 @@ export type ResultConfig = z.infer // ============================================================================ /** - * Schema for headless ACP adapter configuration. + * Schema for headless adapter configuration (version 1). * * @remarks + * Version 1 is maintained for backwards compatibility. + * New features should use version 2. + */ +export const HeadlessAdapterSchemaV1 = z.object({ + /** Schema version 1 */ + version: z.literal(1), + + /** Human-readable adapter name */ + name: z.string(), + + /** Base command to spawn (e.g., ["claude"], ["gemini"]) */ + command: z.array(z.string()), + + /** + * Session mode determines how multi-turn conversations work: + * - 'stream': Keep process alive, multi-turn via stdin + * - 'iterative': New process per turn, accumulate context in prompt + */ + sessionMode: z.enum(['stream', 'iterative']), + + /** How to pass the prompt */ + prompt: PromptConfigSchema, + + /** Output format configuration */ + output: OutputConfigSchema, + + /** Flags for auto-approval in headless mode (e.g., ["--allowedTools", "*"]) */ + autoApprove: z.array(z.string()).optional(), + + /** Session resume support (stream mode only) */ + resume: ResumeConfigSchema.optional(), + + /** Working directory flag (if CLI needs explicit --cwd) */ + cwdFlag: z.string().optional(), + + /** Output event mappings - how to parse CLI output into updates */ + outputEvents: z.array(OutputEventMappingSchema), + + /** Final result extraction configuration */ + result: ResultConfigSchema, + + /** Template for formatting conversation history (iterative mode only) */ + historyTemplate: z.string().optional(), +}) + +/** + * Schema for headless adapter configuration (version 2). + * + * @remarks + * Version 2 adds: + * - `timeout`: Per-agent default timeout in milliseconds + * - `historyTemplate`: More structured template with system and turnFormat + * * This schema defines everything needed to interact with a headless CLI agent: * - Command and flags to spawn * - How to pass prompts @@ -172,19 +225,20 @@ export type ResultConfig = z.infer * Example (Claude): * ```json * { - * "version": 1, + * "version": 2, * "name": "claude-headless", * "command": ["claude"], * "sessionMode": "stream", + * "timeout": 90000, * "prompt": { "flag": "-p" }, * "output": { "flag": "--output-format", "value": "stream-json" }, * "outputEvents": [...] * } * ``` */ -export const HeadlessAdapterSchema = z.object({ - /** Schema version for forward compatibility */ - version: z.literal(1), +export const HeadlessAdapterSchemaV2 = z.object({ + /** Schema version 2 */ + version: z.literal(2), /** Human-readable adapter name */ name: z.string(), @@ -199,6 +253,9 @@ export const HeadlessAdapterSchema = z.object({ */ sessionMode: z.enum(['stream', 'iterative']), + /** Default timeout for this agent in milliseconds (can be overridden per-prompt) */ + timeout: z.number().optional(), + /** How to pass the prompt */ prompt: PromptConfigSchema, @@ -214,16 +271,38 @@ export const HeadlessAdapterSchema = z.object({ /** Working directory flag (if CLI needs explicit --cwd) */ cwdFlag: z.string().optional(), - /** Output event mappings - how to parse CLI output into ACP updates */ + /** Output event mappings - how to parse CLI output into updates */ outputEvents: z.array(OutputEventMappingSchema), /** Final result extraction configuration */ result: ResultConfigSchema, - /** Template for formatting conversation history (iterative mode only) */ - historyTemplate: z.string().optional(), + /** + * Template for formatting conversation history (iterative mode only). + * + * @remarks + * Version 2 supports both string format (simple) and object format (advanced): + * - String: "User: {{input}}\nAssistant: {{output}}" + * - Object: { system: "...", turnFormat: "..." } + */ + historyTemplate: z + .union([ + z.string(), + z.object({ + /** System prefix for accumulated history */ + system: z.string().optional(), + /** Format for each turn: {{input}} and {{output}} placeholders */ + turnFormat: z.string(), + }), + ]) + .optional(), }) +/** + * Schema for headless adapter configuration (supports v1 and v2). + */ +export const HeadlessAdapterSchema = z.union([HeadlessAdapterSchemaV1, HeadlessAdapterSchemaV2]) + /** Headless adapter configuration type */ export type HeadlessAdapterConfig = z.infer diff --git a/src/headless.ts b/src/headless.ts index 34f8484..6e44ab6 100644 --- a/src/headless.ts +++ b/src/headless.ts @@ -1,5 +1,5 @@ /** - * Headless ACP adapter factory - schema-driven adapter for any CLI agent. + * Headless adapter factory - schema-driven adapter for any CLI agent. * * @remarks * Re-exports public API from the headless module. The headless adapter enables @@ -8,12 +8,12 @@ * * **CLI Usage:** * ```bash - * acp-harness headless --schema ./my-agent.json + * agent-eval-harness headless --schema ./my-agent.json * ``` * * **Programmatic Usage:** * ```typescript - * import { parseHeadlessConfig, createSessionManager } from '@plaited/acp-harness/headless' + * import { parseHeadlessConfig, createSessionManager } from '@plaited/agent-eval-harness/headless' * * const schema = parseHeadlessConfig(jsonConfig) * const sessions = createSessionManager({ schema }) @@ -61,6 +61,7 @@ export type { // Output parser export { createOutputParser, jsonPath, jsonPathString } from './headless-output-parser.ts' export type { + ProcessExitInfo, PromptResult, Session, SessionManager, diff --git a/src/integration_tests/acp-claude.spec.ts b/src/integration_tests/acp-claude.spec.ts deleted file mode 100644 index 9ed3639..0000000 --- a/src/integration_tests/acp-claude.spec.ts +++ /dev/null @@ -1,170 +0,0 @@ -/** - * Headless Adapter integration Tests - Claude Code - * - * @remarks - * These tests verify the headless ACP adapter works correctly with Claude Code - * using the schema-driven approach from `.claude/skills/acp-adapters/schemas/`. - * - * Run locally with API key: - * ```bash - * ANTHROPIC_API_KEY=sk-... bun test ./src/tests/acp-claude.spec.ts - * ``` - * - * Prerequisites: - * 1. Claude CLI installed (`bunx @anthropic-ai/claude-code`) - * 2. API key: `ANTHROPIC_API_KEY` environment variable - * - * These tests make real API calls and consume credits. - * - * MCP servers are auto-discovered from project root via: - * - `.mcp.json` - MCP server configuration - */ - -import { afterAll, beforeAll, describe, expect, setDefaultTimeout, test } from 'bun:test' -import { join } from 'node:path' -import { type ACPClient, createACPClient } from '../acp-client.ts' -import { createPrompt, summarizeResponse } from '../acp-helpers.ts' - -// Long timeout for real agent interactions (2 minutes) -setDefaultTimeout(120000) - -// Use project root as cwd - agents discover MCP servers from config files -const PROJECT_ROOT = process.cwd() - -// Schema path for Claude headless adapter -const SCHEMA_PATH = join(PROJECT_ROOT, '.claude/skills/acp-adapters/schemas/claude-headless.json') - -// Get API key from environment -const API_KEY = process.env.ANTHROPIC_API_KEY ?? '' - -// Skip all tests if no API key is available -const describeWithApiKey = API_KEY ? describe : describe.skip - -describeWithApiKey('Headless Adapter Integration - Claude', () => { - let client: ACPClient - - beforeAll(async () => { - // Use headless adapter with Claude schema - client = createACPClient({ - command: ['bun', 'src/headless-cli.ts', '--', '--schema', SCHEMA_PATH], - timeout: 120000, // 2 min timeout for initialization - env: { - ANTHROPIC_API_KEY: API_KEY, - }, - }) - - await client.connect() - }) - - afterAll(async () => { - await client?.disconnect() - }) - - test('connects and initializes via headless adapter', () => { - expect(client.isConnected()).toBe(true) - - const initResult = client.getInitializeResult() - expect(initResult).toBeDefined() - expect(initResult?.protocolVersion).toBeDefined() - }) - - test('reports agent capabilities', () => { - const capabilities = client.getCapabilities() - expect(capabilities).toBeDefined() - }) - - test('creates session with project cwd', async () => { - // Session uses project root - agent discovers MCP servers from .mcp.json - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - expect(session).toBeDefined() - expect(session.id).toBeDefined() - expect(typeof session.id).toBe('string') - }) - - test('sends prompt and receives response', async () => { - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Simple prompt that doesn't require tools - const { result, updates } = await client.promptSync( - session.id, - createPrompt('What is 2 + 2? Reply with just the number.'), - ) - - expect(result).toBeDefined() - expect(updates).toBeInstanceOf(Array) - - // Summarize and verify response structure - const summary = summarizeResponse(updates) - expect(summary.text).toBeDefined() - expect(summary.text.length).toBeGreaterThan(0) - }) - - test('streaming prompt yields updates', async () => { - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - const events: string[] = [] - - for await (const event of client.prompt(session.id, createPrompt('Say "hello" and nothing else.'))) { - events.push(event.type) - if (event.type === 'complete') { - expect(event.result).toBeDefined() - } - } - - expect(events).toContain('complete') - }) - - test('uses MCP server from project config', async () => { - // This test verifies that Claude discovers MCP servers from .mcp.json - // The bun-docs MCP server is configured at project root - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Query the bun-docs MCP server (configured in .mcp.json) - const { updates } = await client.promptSync( - session.id, - createPrompt( - 'Use the bun-docs MCP server to search for information about Bun.serve(). ' + - 'What are the key options for creating an HTTP server with Bun?', - ), - ) - - const summary = summarizeResponse(updates) - - // Response should contain Bun server-related information - expect(summary.text.length).toBeGreaterThan(0) - // Should mention server/HTTP-related concepts from Bun docs - expect(summary.text.toLowerCase()).toMatch(/serve|server|http|port|fetch|handler/) - }) - - test('multi-turn conversation maintains context', async () => { - // Multi-turn: multiple prompts to same session via headless adapter - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Turn 1: Establish context - const { updates: turn1Updates } = await client.promptSync( - session.id, - createPrompt('Remember this number: 42. Just confirm you have it.'), - ) - const turn1Summary = summarizeResponse(turn1Updates) - expect(turn1Summary.text).toMatch(/42|forty.?two|remember/i) - - // Turn 2: Reference previous context - const { updates: turn2Updates } = await client.promptSync( - session.id, - createPrompt('What number did I ask you to remember? Reply with just the number.'), - ) - const turn2Summary = summarizeResponse(turn2Updates) - expect(turn2Summary.text).toMatch(/42/) - }) -}) diff --git a/src/integration_tests/acp-gemini.spec.ts b/src/integration_tests/acp-gemini.spec.ts deleted file mode 100644 index c1602b7..0000000 --- a/src/integration_tests/acp-gemini.spec.ts +++ /dev/null @@ -1,174 +0,0 @@ -/** - * Headless Adapter integration Tests - Gemini CLI - * - * @remarks - * These tests verify the headless ACP adapter works correctly with Gemini CLI - * using the schema-driven approach from `.claude/skills/acp-adapters/schemas/`. - * - * Run locally with API key: - * ```bash - * GEMINI_API_KEY=... bun test ./src/tests/acp-gemini.spec.ts - * ``` - * - * Prerequisites: - * 1. Gemini CLI installed (`npm install -g @anthropic-ai/gemini-cli`) - * 2. API key: `GEMINI_API_KEY` environment variable - * - * These tests make real API calls and consume credits. - * - * MCP servers are auto-discovered from project root via: - * - `.gemini/settings.json` - Gemini MCP server configuration - */ - -import { afterAll, beforeAll, describe, expect, setDefaultTimeout, test } from 'bun:test' -import { join } from 'node:path' -import { type ACPClient, createACPClient } from '../acp-client.ts' -import { createPrompt, summarizeResponse } from '../acp-helpers.ts' - -// Long timeout for real agent interactions (2 minutes) -setDefaultTimeout(120000) - -// Use project root as cwd - agents discover MCP servers from config files -const PROJECT_ROOT = process.cwd() - -// Schema path for Gemini headless adapter -const SCHEMA_PATH = join(PROJECT_ROOT, '.claude/skills/acp-adapters/schemas/gemini-headless.json') - -// Gemini CLI accepts GEMINI_API_KEY -// Use either one if available -const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? '' - -// Skip all tests if no API key is available -const describeWithApiKey = GEMINI_API_KEY ? describe : describe.skip - -describeWithApiKey('Headless Adapter Integration - Gemini', () => { - let client: ACPClient - - beforeAll(async () => { - // Use headless adapter with Gemini schema - // Pass both API key variants - Gemini CLI should pick up whichever it prefers - client = createACPClient({ - command: ['bun', 'src/headless-cli.ts', '--', '--schema', SCHEMA_PATH], - timeout: 120000, // 2 min timeout for initialization - env: { - GEMINI_API_KEY, - }, - }) - - await client.connect() - }) - - afterAll(async () => { - await client?.disconnect() - }) - - test('connects and initializes via headless adapter', () => { - expect(client.isConnected()).toBe(true) - - const initResult = client.getInitializeResult() - expect(initResult).toBeDefined() - expect(initResult?.protocolVersion).toBeDefined() - }) - - test('reports agent capabilities', () => { - const capabilities = client.getCapabilities() - expect(capabilities).toBeDefined() - }) - - test('creates session with project cwd', async () => { - // Session uses project root - agent discovers MCP servers from .gemini/settings.json - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - expect(session).toBeDefined() - expect(session.id).toBeDefined() - expect(typeof session.id).toBe('string') - }) - - test('sends prompt and receives response', async () => { - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Simple prompt that doesn't require tools - const { result, updates } = await client.promptSync( - session.id, - createPrompt('What is 2 + 2? Reply with just the number.'), - ) - - expect(result).toBeDefined() - expect(updates).toBeInstanceOf(Array) - - // Summarize and verify response structure - const summary = summarizeResponse(updates) - expect(summary.text).toBeDefined() - expect(summary.text.length).toBeGreaterThan(0) - // Should contain "4" somewhere in the response - expect(summary.text).toMatch(/4/) - }) - - test('streaming prompt yields updates', async () => { - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - const events: string[] = [] - - for await (const event of client.prompt(session.id, createPrompt('Say "hello" and nothing else.'))) { - events.push(event.type) - if (event.type === 'complete') { - expect(event.result).toBeDefined() - } - } - - expect(events).toContain('complete') - }) - - test('uses MCP server from project config', async () => { - // This test verifies that Gemini discovers MCP servers from .gemini/settings.json - // The agent-client-protocol MCP server is configured at project root - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Query the agent-client-protocol MCP server (configured in .gemini/settings.json) - const { updates } = await client.promptSync( - session.id, - createPrompt( - 'Use the agent-client-protocol MCP server to search for information about ACP. ' + - 'What is the Agent Client Protocol and what problem does it solve?', - ), - ) - - const summary = summarizeResponse(updates) - - // Response should contain ACP-related information - expect(summary.text.length).toBeGreaterThan(0) - // Should mention protocol/agent-related concepts - expect(summary.text.toLowerCase()).toMatch(/agent|protocol|client|json-rpc|stdio/) - }) - - test('multi-turn conversation maintains context (iterative mode)', async () => { - // Multi-turn via headless adapter in iterative mode (history accumulation) - const session = await client.createSession({ - cwd: PROJECT_ROOT, - }) - - // Turn 1: Establish context - const { updates: turn1Updates } = await client.promptSync( - session.id, - createPrompt('Remember this number: 42. Just confirm you have it.'), - ) - const turn1Summary = summarizeResponse(turn1Updates) - expect(turn1Summary.text).toMatch(/42|forty.?two|remember/i) - - // Turn 2: Reference previous context - const { updates: turn2Updates } = await client.promptSync( - session.id, - createPrompt('What number did I ask you to remember? Reply with just the number.'), - ) - const turn2Summary = summarizeResponse(turn2Updates) - expect(turn2Summary.text).toMatch(/42/) - }) -}) diff --git a/src/schemas-cli.ts b/src/schemas-cli.ts index c8f2723..e20f0ab 100644 --- a/src/schemas-cli.ts +++ b/src/schemas-cli.ts @@ -195,7 +195,7 @@ export const schemasCli = async (args: string[]): Promise => { if (values.help) { // biome-ignore lint/suspicious/noConsole: CLI help output console.log(` -Usage: acp-harness schemas [schema-name] [options] +Usage: agent-eval-harness schemas [schema-name] [options] Arguments: schema-name Specific schema to export (optional) @@ -214,17 +214,17 @@ Available Schemas: Examples: # List available schemas - acp-harness schemas --list + agent-eval-harness schemas --list # Export all schemas as single JSON file - acp-harness schemas --json -o schemas.json + agent-eval-harness schemas --json -o schemas.json # Export specific schema - acp-harness schemas CaptureResult --json - acp-harness schemas TrialResult --json -o trial-schema.json + agent-eval-harness schemas CaptureResult --json + agent-eval-harness schemas TrialResult --json -o trial-schema.json # Export all schemas as separate files - acp-harness schemas --json --split -o schemas/ + agent-eval-harness schemas --json --split -o schemas/ `) return } diff --git a/src/schemas.ts b/src/schemas.ts index ad114c1..b7f852b 100644 --- a/src/schemas.ts +++ b/src/schemas.ts @@ -1,5 +1,5 @@ /** - * Unified Zod schemas and types for the ACP harness. + * Unified Zod schemas and types for the agent eval harness. * * @remarks * This module follows a schema-first approach where Zod schemas are the @@ -7,36 +7,21 @@ * * **Exports:** * - Harness schemas: PromptCaseSchema, GraderResultSchema, CaptureResultSchema, etc. - * - JSON-RPC schemas: JsonRpcRequestSchema, JsonRpcResponseSchema, etc. - * - ACP SDK type schemas: SessionNotificationSchema, RequestPermissionRequestSchema + * - JSON-RPC schemas: JsonRpcRequestSchema, JsonRpcResponseSchema, etc. (for headless adapter) * - All inferred types via `z.infer<>` * * **JSON Schema generation (Zod 4):** * ```typescript * import { z } from 'zod' - * import { CaptureResultSchema } from '@plaited/acp-harness/schemas' + * import { CaptureResultSchema } from '@plaited/agent-eval-harness/schemas' * const jsonSchema = z.toJSONSchema(CaptureResultSchema) * ``` * * @packageDocumentation */ -import type { RequestPermissionRequest, SessionId, SessionNotification } from '@agentclientprotocol/sdk' import { z } from 'zod' -// ============================================================================ -// Internal Type Utilities -// ============================================================================ - -/** Precise type detection beyond typeof operator */ -const trueTypeOf = (obj?: unknown): string => Object.prototype.toString.call(obj).slice(8, -1).toLowerCase() - -/** Type guard for precise type checking with TypeScript narrowing */ -const isTypeOf = (obj: unknown, type: string): obj is T => trueTypeOf(obj) === type - -/** Type guard for object shape validation */ -const isRecord = (val: unknown): val is Record => isTypeOf>(val, 'object') - // ============================================================================ // Session Types // ============================================================================ @@ -45,7 +30,7 @@ const isRecord = (val: unknown): val is Record => isTypeOf, + id: z.string(), _meta: z.record(z.string(), z.unknown()).nullish(), }) @@ -53,7 +38,7 @@ export const SessionSchema = z.object({ export type Session = z.infer // ============================================================================ -// JSON-RPC 2.0 Schemas +// JSON-RPC 2.0 Schemas (for headless adapter) // ============================================================================ /** JSON-RPC version literal */ @@ -72,7 +57,6 @@ const RequestIdSchema = z.union([z.string(), z.number()]) * - `-32601`: Method not found * - `-32602`: Invalid params * - `-32603`: Internal error - * - `-32800`: Request cancelled (ACP extension) */ export const JsonRpcErrorSchema = z.object({ code: z.number(), @@ -147,33 +131,6 @@ export const JsonRpcMessageSchema = z.union([JsonRpcRequestSchema, JsonRpcNotifi /** Union of all JSON-RPC message types */ export type JsonRpcMessage = JsonRpcRequest | JsonRpcNotification | JsonRpcResponse -// ============================================================================ -// ACP SDK Type Schemas (Custom Validators) -// ============================================================================ - -/** - * Schema for session update notifications. - * - * @remarks - * Validates `sessionId` and `update` fields used in notification handling. - * Uses z.custom() to validate SDK types at runtime while keeping SDK types - * as the source of truth. - */ -export const SessionNotificationSchema = z.custom( - (val): val is SessionNotification => - isRecord(val) && 'sessionId' in val && typeof val.sessionId === 'string' && 'update' in val && isRecord(val.update), -) - -/** - * Schema for permission requests from agent. - * - * @remarks - * Validates `options` array used in permission handling. - */ -export const RequestPermissionRequestSchema = z.custom( - (val): val is RequestPermissionRequest => isRecord(val) && 'options' in val && Array.isArray(val.options), -) - // ============================================================================ // MCP Server Configuration Schemas // ============================================================================ @@ -297,24 +254,6 @@ export const ToolInputSchema = z /** Tool input type */ export type ToolInput = z.infer -/** - * Token usage schema for adapter-specific usage data. - * - * @remarks - * ACP SDK's SessionNotification doesn't declare a 'usage' field, but adapters - * like Claude Code extend responses with token counts at runtime. This schema - * provides runtime validation for that extension. - */ -export const TokenUsageSchema = z - .object({ - inputTokens: z.number().optional(), - outputTokens: z.number().optional(), - }) - .passthrough() - -/** Token usage type */ -export type TokenUsage = z.infer - /** Thought trajectory step */ export const ThoughtStepSchema = z.object({ type: z.literal('thought'), diff --git a/src/tests/acp-client.spec.ts b/src/tests/acp-client.spec.ts deleted file mode 100644 index 20dd306..0000000 --- a/src/tests/acp-client.spec.ts +++ /dev/null @@ -1,205 +0,0 @@ -import { describe, expect, test } from 'bun:test' -import { ACPClientError, createACPClient } from '../acp-client.ts' - -// ============================================================================ -// ACPClientError Tests -// ============================================================================ - -describe('ACPClientError', () => { - test('creates error with message only', () => { - const error = new ACPClientError('Connection failed') - expect(error.message).toBe('Connection failed') - expect(error.name).toBe('ACPClientError') - expect(error.code).toBeUndefined() - }) - - test('creates error with code', () => { - const error = new ACPClientError('Not connected', 'NOT_CONNECTED') - expect(error.code).toBe('NOT_CONNECTED') - }) - - test('is instance of Error', () => { - const error = new ACPClientError('Test') - expect(error instanceof Error).toBe(true) - expect(error instanceof ACPClientError).toBe(true) - }) -}) - -// ============================================================================ -// Client Factory Tests -// ============================================================================ - -describe('createACPClient', () => { - test('creates client with minimal config', () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - expect(client).toBeDefined() - expect(typeof client.connect).toBe('function') - expect(typeof client.disconnect).toBe('function') - expect(typeof client.createSession).toBe('function') - expect(typeof client.prompt).toBe('function') - expect(typeof client.promptSync).toBe('function') - expect(typeof client.cancelPrompt).toBe('function') - expect(typeof client.getCapabilities).toBe('function') - expect(typeof client.getInitializeResult).toBe('function') - expect(typeof client.isConnected).toBe('function') - }) - - test('creates client with full config', () => { - const client = createACPClient({ - command: ['claude', 'code'], - cwd: '/tmp', - env: { TEST: 'value' }, - clientInfo: { name: 'test-client', version: '1.0.0' }, - capabilities: { fs: { readTextFile: true } }, - timeout: 60000, - onPermissionRequest: async () => ({ outcome: { outcome: 'cancelled' } }), - }) - - expect(client).toBeDefined() - }) -}) - -// ============================================================================ -// State Methods (before connection) -// ============================================================================ - -describe('Client state before connection', () => { - test('isConnected returns false before connect', () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - expect(client.isConnected()).toBe(false) - }) - - test('getCapabilities returns undefined before connect', () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - expect(client.getCapabilities()).toBeUndefined() - }) - - test('getInitializeResult returns undefined before connect', () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - expect(client.getInitializeResult()).toBeUndefined() - }) -}) - -// ============================================================================ -// Operations Before Connection -// ============================================================================ - -describe('Operations before connection', () => { - test('createSession throws when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - await expect(client.createSession({ cwd: '/tmp' })).rejects.toThrow('Not connected') - }) - - test('promptSync throws when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - await expect(client.promptSync('session-1', [{ type: 'text', text: 'Hello' }])).rejects.toThrow('Not connected') - }) - - test('cancelPrompt throws when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - await expect(client.cancelPrompt('session-1')).rejects.toThrow('Not connected') - }) - - test('prompt generator throws when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - const generator = client.prompt('session-1', [{ type: 'text', text: 'Hello' }]) - - await expect(generator.next()).rejects.toThrow('Not connected') - }) -}) - -// ============================================================================ -// Disconnect Safety -// ============================================================================ - -describe('Disconnect safety', () => { - test('disconnect is safe when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - // Should not throw - await client.disconnect() - expect(client.isConnected()).toBe(false) - }) - - test('disconnect with graceful=false is safe when not connected', async () => { - const client = createACPClient({ - command: ['echo', 'test'], - }) - - // Should not throw - await client.disconnect(false) - expect(client.isConnected()).toBe(false) - }) -}) - -// ============================================================================ -// Integration Tests with Mock Process -// ============================================================================ - -describe('Client with mock process', () => { - test('connect starts transport', async () => { - const client = createACPClient({ - command: ['cat'], // cat echoes back input - timeout: 1000, - }) - - // Start connection - cat won't respond with proper JSON-RPC - // so this will timeout, but it tests the transport startup - try { - await client.connect() - } catch { - // Expected - cat doesn't speak JSON-RPC - } - - // Cleanup - await client.disconnect(false) - }) - - test('connect throws when already connected', async () => { - const client = createACPClient({ - command: ['cat'], - timeout: 500, - }) - - // Start first connection - const connectPromise = client.connect() - - // Immediately try second connection (before first completes) - // This should throw because transport is started - await expect(client.connect()).rejects.toThrow('Already connected') - - // Cleanup - wait for first connect to timeout then disconnect - try { - await connectPromise - } catch { - // Expected timeout - } - await client.disconnect(false) - }) -}) diff --git a/src/tests/acp-helpers.spec.ts b/src/tests/acp-helpers.spec.ts deleted file mode 100644 index 0779ee9..0000000 --- a/src/tests/acp-helpers.spec.ts +++ /dev/null @@ -1,105 +0,0 @@ -import { describe, expect, test } from 'bun:test' -import type { SessionNotification } from '@agentclientprotocol/sdk' -import { createPrompt, createPromptWithFiles, createPromptWithImage, summarizeResponse } from '../acp-helpers.ts' - -// ============================================================================ -// Prompt Building Utilities -// ============================================================================ - -describe('createPrompt', () => { - test('creates single text block prompt', () => { - const prompt = createPrompt('Hello agent') - expect(prompt).toHaveLength(1) - expect(prompt[0]).toEqual({ type: 'text', text: 'Hello agent' }) - }) -}) - -describe('createPromptWithFiles', () => { - test('creates prompt with file context', () => { - const prompt = createPromptWithFiles('Analyze this', [ - { path: '/src/main.ts', content: 'const x = 1;' }, - { path: '/src/utils.ts', content: 'export const y = 2;' }, - ]) - expect(prompt).toHaveLength(3) - expect(prompt[0]).toEqual({ type: 'text', text: 'Analyze this' }) - expect(prompt[1]?.type).toBe('resource') - expect(prompt[2]?.type).toBe('resource') - }) -}) - -describe('createPromptWithImage', () => { - test('creates prompt with image', () => { - const prompt = createPromptWithImage({ text: 'Describe this', imageData: 'base64img', mimeType: 'image/png' }) - expect(prompt).toHaveLength(2) - expect(prompt[0]).toEqual({ type: 'text', text: 'Describe this' }) - expect(prompt[1]).toEqual({ - type: 'image', - data: 'base64img', - mimeType: 'image/png', - }) - }) -}) - -// ============================================================================ -// Response Analysis -// ============================================================================ - -describe('summarizeResponse', () => { - test('creates comprehensive summary', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Processing...' } }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read', status: 'in_progress' }, - }, - { - sessionId: 's1', - update: { - sessionUpdate: 'plan', - entries: [{ content: 'Step 1', status: 'in_progress', priority: 'high' }], - }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Done!' } }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read', status: 'completed' }, - }, - { - sessionId: 's1', - update: { - sessionUpdate: 'plan', - entries: [{ content: 'Step 1', status: 'completed', priority: 'high' }], - }, - }, - ] - - const summary = summarizeResponse(notifications) - - expect(summary.text).toBe('Processing...Done!') - expect(summary.toolCallCount).toBe(1) - expect(summary.completedToolCalls).toHaveLength(1) - expect(summary.failedToolCalls).toHaveLength(0) - expect(summary.plan).toHaveLength(1) - expect(summary.planProgress).toBe(100) - expect(summary.hasErrors).toBe(false) - }) - - test('detects errors in summary', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read', status: 'failed' }, - }, - ] - - const summary = summarizeResponse(notifications) - expect(summary.hasErrors).toBe(true) - expect(summary.failedToolCalls).toHaveLength(1) - }) -}) diff --git a/src/tests/acp-transport.spec.ts b/src/tests/acp-transport.spec.ts deleted file mode 100644 index b99bc28..0000000 --- a/src/tests/acp-transport.spec.ts +++ /dev/null @@ -1,153 +0,0 @@ -import { describe, expect, test } from 'bun:test' -import { createACPTransport } from '../acp-transport.ts' - -// ============================================================================ -// Transport Creation Tests (without spawning) -// ============================================================================ - -describe('createACPTransport', () => { - test('throws on empty command', async () => { - const transport = createACPTransport({ - command: [], - }) - - await expect(transport.start()).rejects.toThrow('Command array is empty') - }) - - test('isConnected returns false before start', async () => { - const transport = createACPTransport({ - command: ['echo', 'test'], - }) - - expect(transport.isConnected()).toBe(false) - }) - - test('request throws when not connected', async () => { - const transport = createACPTransport({ - command: ['echo', 'test'], - }) - - await expect(transport.request('test/method')).rejects.toThrow('Transport is not connected') - }) - - test('notify throws when not connected', async () => { - const transport = createACPTransport({ - command: ['echo', 'test'], - }) - - await expect(transport.notify('test/notification')).rejects.toThrow('Transport is not connected') - }) - - test('close is safe when not started', async () => { - const transport = createACPTransport({ - command: ['echo', 'test'], - }) - - // Should not throw - await transport.close() - expect(transport.isConnected()).toBe(false) - }) -}) - -// ============================================================================ -// Mock Subprocess Integration Tests -// ============================================================================ - -describe('Transport with mock subprocess', () => { - test('starts transport with valid command', async () => { - const transport = createACPTransport({ - command: ['cat'], // cat echoes back input - timeout: 1000, - }) - - await transport.start() - expect(transport.isConnected()).toBe(true) - - // Close immediately since cat doesn't speak JSON-RPC - await transport.close(false) - expect(transport.isConnected()).toBe(false) - }) - - test('throws on duplicate start', async () => { - const transport = createACPTransport({ - command: ['cat'], - timeout: 1000, - }) - - await transport.start() - - try { - await expect(transport.start()).rejects.toThrow('Transport already started') - } finally { - await transport.close(false) - } - }) - - test('handles process exit', async () => { - const { createACPTransport } = await import('../acp-transport.ts') - - let closeCalled = false - let closeCode: number | null = null - - const transport = createACPTransport({ - command: ['true'], // exits immediately with code 0 - timeout: 1000, - onClose: (code) => { - closeCalled = true - closeCode = code - }, - }) - - await transport.start() - - // Wait for process to exit - await new Promise((resolve) => setTimeout(resolve, 100)) - - expect(closeCalled).toBe(true) - expect(closeCode === 0).toBe(true) - }) - - test('handles invalid command', async () => { - const transport = createACPTransport({ - command: ['nonexistent-command-that-does-not-exist-12345'], - timeout: 1000, - }) - - // Bun.spawn may throw or exit with error depending on the command - try { - await transport.start() - // If it doesn't throw, wait for process exit - await new Promise((resolve) => setTimeout(resolve, 100)) - } catch { - // Expected - command not found - } - }) -}) - -// ============================================================================ -// Error Handling Tests -// ============================================================================ - -describe('Transport error handling', () => { - test('request times out when no response received', async () => { - // TODO(human): Implement timeout test - }) - - test('close rejects pending requests', async () => { - const transport = createACPTransport({ - command: ['cat'], - timeout: 5000, - }) - - await transport.start() - - // Start a request that will never complete (cat doesn't speak JSON-RPC) - const requestPromise = transport.request('test/method') - - // Close transport while request is pending - await transport.close(false) - - // Request should be rejected with "Transport closed" - await expect(requestPromise).rejects.toThrow('Transport closed') - }) -}) diff --git a/src/tests/acp-utils.spec.ts b/src/tests/acp-utils.spec.ts deleted file mode 100644 index 5f869e2..0000000 --- a/src/tests/acp-utils.spec.ts +++ /dev/null @@ -1,394 +0,0 @@ -import { describe, expect, test } from 'bun:test' -import type { ContentBlock, PlanEntry, SessionNotification, ToolCall } from '@agentclientprotocol/sdk' -import { - createAudioContent, - createBlobResource, - createImageContent, - createResourceLink, - createTextContent, - createTextResource, - extractLatestToolCalls, - extractPlan, - extractText, - extractTextFromUpdates, - extractToolCalls, - filterPlanByStatus, - filterToolCallsByStatus, - filterToolCallsByTitle, - getCompletedToolCallsWithContent, - getPlanProgress, - hasToolCallErrors, -} from '../acp-utils.ts' - -// ============================================================================ -// Content Block Builders -// ============================================================================ - -describe('createTextContent', () => { - test('creates text content block', () => { - const content = createTextContent('Hello world') - expect(content.type).toBe('text') - // Type narrowing to access text property - if (content.type === 'text') { - expect(content.text).toBe('Hello world') - } - }) -}) - -describe('createImageContent', () => { - test('creates image content with required fields', () => { - const content = createImageContent('base64data', 'image/png') - expect(content.type).toBe('image') - if (content.type === 'image') { - expect(content.data).toBe('base64data') - expect(content.mimeType).toBe('image/png') - } - }) -}) - -describe('createAudioContent', () => { - test('creates audio content block', () => { - const content = createAudioContent('audiodata', 'audio/wav') - expect(content.type).toBe('audio') - if (content.type === 'audio') { - expect(content.data).toBe('audiodata') - expect(content.mimeType).toBe('audio/wav') - } - }) -}) - -describe('createResourceLink', () => { - test('creates resource link with uri and name', () => { - const content = createResourceLink({ uri: 'file:///path/to/file.ts', name: 'file.ts' }) - expect(content.type).toBe('resource_link') - if (content.type === 'resource_link') { - expect(content.uri).toBe('file:///path/to/file.ts') - expect(content.name).toBe('file.ts') - } - }) - - test('includes optional mimeType', () => { - const content = createResourceLink({ uri: 'file:///path/to/file.ts', name: 'file.ts', mimeType: 'text/typescript' }) - if (content.type === 'resource_link') { - expect(content.mimeType).toBe('text/typescript') - } - }) -}) - -describe('createTextResource', () => { - test('creates embedded text resource', () => { - const content = createTextResource({ uri: 'file:///src/main.ts', text: 'const x = 1;' }) - expect(content.type).toBe('resource') - if (content.type === 'resource') { - expect(content.resource.uri).toBe('file:///src/main.ts') - expect('text' in content.resource && content.resource.text).toBe('const x = 1;') - } - }) - - test('includes optional mimeType', () => { - const content = createTextResource({ - uri: 'file:///src/main.ts', - text: 'const x = 1;', - mimeType: 'text/typescript', - }) - if (content.type === 'resource' && 'text' in content.resource) { - expect(content.resource.mimeType).toBe('text/typescript') - } - }) -}) - -describe('createBlobResource', () => { - test('creates embedded blob resource', () => { - const content = createBlobResource({ uri: 'file:///image.png', blob: 'base64blobdata' }) - expect(content.type).toBe('resource') - if (content.type === 'resource' && 'blob' in content.resource) { - expect(content.resource.uri).toBe('file:///image.png') - expect(content.resource.blob).toBe('base64blobdata') - } - }) -}) - -// ============================================================================ -// Content Extraction -// ============================================================================ - -describe('extractText', () => { - test('extracts text from text content blocks', () => { - const content: ContentBlock[] = [ - { type: 'text', text: 'Hello' }, - { type: 'text', text: 'World' }, - ] - expect(extractText(content)).toBe('Hello\nWorld') - }) - - test('ignores non-text content blocks', () => { - const content: ContentBlock[] = [ - { type: 'text', text: 'Hello' }, - { type: 'image', data: 'base64', mimeType: 'image/png' }, - { type: 'text', text: 'World' }, - ] - expect(extractText(content)).toBe('Hello\nWorld') - }) - - test('returns empty string for no text blocks', () => { - const content: ContentBlock[] = [{ type: 'image', data: 'base64', mimeType: 'image/png' }] - expect(extractText(content)).toBe('') - }) - - test('handles empty array', () => { - expect(extractText([])).toBe('') - }) -}) - -describe('extractTextFromUpdates', () => { - test('extracts text from agent message chunks', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'First' } }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Second' } }, - }, - ] - expect(extractTextFromUpdates(notifications)).toBe('FirstSecond') - }) - - test('skips non-text content updates', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hello' } }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'pending' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'World' } }, - }, - ] - expect(extractTextFromUpdates(notifications)).toBe('HelloWorld') - }) -}) - -describe('extractToolCalls', () => { - test('extracts all tool calls from notifications', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'completed' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't2', title: 'write_file', status: 'in_progress' }, - }, - ] - const calls = extractToolCalls(notifications) - expect(calls).toHaveLength(2) - expect(calls[0]?.title).toBe('read_file') - expect(calls[1]?.title).toBe('write_file') - }) - - test('returns empty array when no tool calls', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hello' } }, - }, - ] - expect(extractToolCalls(notifications)).toEqual([]) - }) -}) - -describe('extractLatestToolCalls', () => { - test('returns latest state of each tool call', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'pending' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'in_progress' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'completed' }, - }, - ] - const latest = extractLatestToolCalls(notifications) - expect(latest.size).toBe(1) - expect(latest.get('t1')?.status).toBe('completed') - }) - - test('tracks multiple tool calls independently', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'read_file', status: 'completed' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't2', title: 'write_file', status: 'in_progress' }, - }, - ] - const latest = extractLatestToolCalls(notifications) - expect(latest.size).toBe(2) - expect(latest.get('t1')?.status).toBe('completed') - expect(latest.get('t2')?.status).toBe('in_progress') - }) -}) - -describe('extractPlan', () => { - test('returns latest plan from notifications', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { - sessionUpdate: 'plan', - entries: [{ content: 'Step 1', status: 'pending', priority: 'medium' }], - }, - }, - { - sessionId: 's1', - update: { - sessionUpdate: 'plan', - entries: [ - { content: 'Step 1', status: 'completed', priority: 'medium' }, - { content: 'Step 2', status: 'in_progress', priority: 'medium' }, - ], - }, - }, - ] - const plan = extractPlan(notifications) - expect(plan).toHaveLength(2) - expect(plan?.[0]?.status).toBe('completed') - }) - - test('returns undefined when no plan in updates', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hi' } }, - }, - ] - expect(extractPlan(notifications)).toBeUndefined() - }) -}) - -// ============================================================================ -// Tool Call Utilities -// ============================================================================ - -describe('filterToolCallsByStatus', () => { - const toolCalls: ToolCall[] = [ - { toolCallId: 't1', title: 'a', status: 'completed' }, - { toolCallId: 't2', title: 'b', status: 'failed' }, - { toolCallId: 't3', title: 'c', status: 'completed' }, - ] - - test('filters by completed status', () => { - const result = filterToolCallsByStatus(toolCalls, 'completed') - expect(result).toHaveLength(2) - expect(result.every((c) => c.status === 'completed')).toBe(true) - }) - - test('filters by failed status', () => { - const result = filterToolCallsByStatus(toolCalls, 'failed') - expect(result).toHaveLength(1) - expect(result[0]?.title).toBe('b') - }) -}) - -describe('filterToolCallsByTitle', () => { - const toolCalls: ToolCall[] = [ - { toolCallId: 't1', title: 'read_file', status: 'completed' }, - { toolCallId: 't2', title: 'write_file', status: 'completed' }, - { toolCallId: 't3', title: 'read_file', status: 'completed' }, - ] - - test('filters by tool title', () => { - const result = filterToolCallsByTitle(toolCalls, 'read_file') - expect(result).toHaveLength(2) - }) -}) - -describe('hasToolCallErrors', () => { - test('returns true when failed tool calls exist', () => { - const toolCalls: ToolCall[] = [ - { toolCallId: 't1', title: 'a', status: 'completed' }, - { toolCallId: 't2', title: 'b', status: 'failed' }, - ] - expect(hasToolCallErrors(toolCalls)).toBe(true) - }) - - test('returns false when no failed tool calls', () => { - const toolCalls: ToolCall[] = [ - { toolCallId: 't1', title: 'a', status: 'completed' }, - { toolCallId: 't2', title: 'b', status: 'completed' }, - ] - expect(hasToolCallErrors(toolCalls)).toBe(false) - }) -}) - -describe('getCompletedToolCallsWithContent', () => { - test('returns completed calls with content', () => { - const toolCalls: ToolCall[] = [ - { - toolCallId: 't1', - title: 'read', - status: 'completed', - content: [{ type: 'content', content: { type: 'text', text: 'file content' } }], - }, - { toolCallId: 't2', title: 'write', status: 'completed' }, - { toolCallId: 't3', title: 'fetch', status: 'in_progress' }, - ] - const result = getCompletedToolCallsWithContent(toolCalls) - expect(result).toHaveLength(1) - expect(result[0]?.title).toBe('read') - }) -}) - -// ============================================================================ -// Plan Utilities -// ============================================================================ - -describe('filterPlanByStatus', () => { - const plan: PlanEntry[] = [ - { content: 'Step 1', status: 'completed', priority: 'high' }, - { content: 'Step 2', status: 'in_progress', priority: 'medium' }, - { content: 'Step 3', status: 'pending', priority: 'low' }, - ] - - test('filters by status', () => { - expect(filterPlanByStatus(plan, 'completed')).toHaveLength(1) - expect(filterPlanByStatus(plan, 'pending')).toHaveLength(1) - }) -}) - -describe('getPlanProgress', () => { - test('calculates completion percentage', () => { - const plan: PlanEntry[] = [ - { content: 'Step 1', status: 'completed', priority: 'high' }, - { content: 'Step 2', status: 'completed', priority: 'high' }, - { content: 'Step 3', status: 'pending', priority: 'medium' }, - { content: 'Step 4', status: 'pending', priority: 'low' }, - ] - expect(getPlanProgress(plan)).toBe(50) - }) - - test('returns 100 for empty plan', () => { - expect(getPlanProgress([])).toBe(100) - }) - - test('returns 100 for all completed', () => { - const plan: PlanEntry[] = [ - { content: 'Step 1', status: 'completed', priority: 'high' }, - { content: 'Step 2', status: 'completed', priority: 'medium' }, - ] - expect(getPlanProgress(plan)).toBe(100) - }) -}) diff --git a/src/tests/adapter-check.spec.ts b/src/tests/adapter-check.spec.ts deleted file mode 100644 index fa0d558..0000000 --- a/src/tests/adapter-check.spec.ts +++ /dev/null @@ -1,70 +0,0 @@ -/** - * Tests for adapter compliance checking functionality. - */ - -import { describe, expect, test } from 'bun:test' -import { type CheckConfig, runCheck } from '../adapter-check.ts' - -describe('runCheck', () => { - test('fails spawn check for non-existent command', async () => { - const config: CheckConfig = { - command: ['nonexistent-command-xyz'], - timeout: 1000, - verbose: false, - } - - const result = await runCheck(config) - - expect(result.passed).toBe(false) - expect(result.checks.length).toBeGreaterThanOrEqual(1) - expect(result.checks[0]?.name).toBe('spawn') - expect(result.checks[0]?.passed).toBe(false) - }) - - test('fails spawn check for command that exits immediately', async () => { - const config: CheckConfig = { - command: ['false'], // Unix command that exits with code 1 - timeout: 1000, - verbose: false, - } - - const result = await runCheck(config) - - expect(result.passed).toBe(false) - expect(result.summary.failed).toBeGreaterThanOrEqual(1) - }) - - test('returns structured result with summary', async () => { - const config: CheckConfig = { - command: ['echo', 'test'], - timeout: 1000, - verbose: false, - } - - const result = await runCheck(config) - - expect(result).toHaveProperty('passed') - expect(result).toHaveProperty('checks') - expect(result).toHaveProperty('summary') - expect(result.summary).toHaveProperty('total') - expect(result.summary).toHaveProperty('passed') - expect(result.summary).toHaveProperty('failed') - expect(typeof result.passed).toBe('boolean') - expect(Array.isArray(result.checks)).toBe(true) - }) - - test('includes verbose details when enabled', async () => { - const config: CheckConfig = { - command: ['echo', 'test'], - timeout: 1000, - verbose: true, - } - - const result = await runCheck(config) - - // At least the spawn check should have details in verbose mode - const spawnCheck = result.checks.find((c) => c.name === 'spawn') - expect(spawnCheck).toBeDefined() - // Note: details may or may not be present depending on check outcome - }) -}) diff --git a/src/tests/adapter-scaffold.spec.ts b/src/tests/adapter-scaffold.spec.ts deleted file mode 100644 index 1a6f92e..0000000 --- a/src/tests/adapter-scaffold.spec.ts +++ /dev/null @@ -1,112 +0,0 @@ -/** - * Tests for adapter scaffolding functionality. - */ - -import { afterEach, describe, expect, test } from 'bun:test' -import { rm } from 'node:fs/promises' -import { join } from 'node:path' -import { runScaffold, type ScaffoldConfig } from '../adapter-scaffold.ts' - -const testDir = join(import.meta.dir, 'fixtures', 'scaffold-output') - -describe('runScaffold', () => { - afterEach(async () => { - // Clean up test output - await rm(testDir, { recursive: true, force: true }) - }) - - test('generates TypeScript adapter structure', async () => { - const config: ScaffoldConfig = { - name: 'test-agent', - outputDir: testDir, - lang: 'ts', - minimal: false, - } - - const result = await runScaffold(config) - - expect(result.outputDir).toBe(testDir) - expect(result.lang).toBe('ts') - expect(result.files).toContain('package.json') - expect(result.files).toContain('tsconfig.json') - expect(result.files).toContain('src/main.ts') - expect(result.files).toContain('src/types.ts') - expect(result.files).toContain('src/session-manager.ts') - expect(result.files).toContain('src/handlers/initialize.ts') - expect(result.files).toContain('src/handlers/session-new.ts') - expect(result.files).toContain('src/handlers/session-prompt.ts') - expect(result.files).toContain('src/handlers/session-cancel.ts') - expect(result.files).toContain('README.md') - - // Verify files actually exist - const packageJson = await Bun.file(join(testDir, 'package.json')).text() - expect(packageJson).toContain('"test-agent-acp"') - - const mainTs = await Bun.file(join(testDir, 'src', 'main.ts')).text() - expect(mainTs).toContain('#!/usr/bin/env bun') - expect(mainTs).toContain('handleInitialize') - }) - - test('generates minimal TypeScript structure without README', async () => { - const config: ScaffoldConfig = { - name: 'minimal-agent', - outputDir: testDir, - lang: 'ts', - minimal: true, - } - - const result = await runScaffold(config) - - expect(result.files).not.toContain('README.md') - expect(result.files).toContain('package.json') - expect(result.files).toContain('src/main.ts') - }) - - test('generates Python adapter structure', async () => { - const config: ScaffoldConfig = { - name: 'python-agent', - outputDir: testDir, - lang: 'python', - minimal: false, - } - - const result = await runScaffold(config) - - expect(result.lang).toBe('python') - expect(result.files).toContain('adapter.py') - expect(result.files).toContain('README.md') - - const adapterPy = await Bun.file(join(testDir, 'adapter.py')).text() - expect(adapterPy).toContain('#!/usr/bin/env python3') - expect(adapterPy).toContain('python-agent') - expect(adapterPy).toContain('def handle_initialize') - }) - - test('generates minimal Python structure without README', async () => { - const config: ScaffoldConfig = { - name: 'minimal-python', - outputDir: testDir, - lang: 'python', - minimal: true, - } - - const result = await runScaffold(config) - - expect(result.files).toContain('adapter.py') - expect(result.files).not.toContain('README.md') - }) - - test('package.json contains correct name', async () => { - const config: ScaffoldConfig = { - name: 'my-special-agent', - outputDir: testDir, - lang: 'ts', - minimal: true, - } - - await runScaffold(config) - - const packageJson = JSON.parse(await Bun.file(join(testDir, 'package.json')).text()) - expect(packageJson.name).toBe('my-special-agent-acp') - }) -}) diff --git a/src/tests/capture-cli.spec.ts b/src/tests/capture-cli.spec.ts index bc19d62..c9666df 100644 --- a/src/tests/capture-cli.spec.ts +++ b/src/tests/capture-cli.spec.ts @@ -110,22 +110,23 @@ describe('runCapture configuration', () => { // Type-level test - if this compiles, the types are correct const config: CaptureConfig = { promptsPath: '/tmp/prompts.jsonl', - agentCommand: ['bunx', 'test-agent'], + schemaPath: './schemas/claude-headless.json', outputPath: '/tmp/output.jsonl', cwd: '/tmp', timeout: 30000, progress: true, append: false, + debug: false, } expect(config.promptsPath).toBe('/tmp/prompts.jsonl') - expect(config.agentCommand).toEqual(['bunx', 'test-agent']) + expect(config.schemaPath).toBe('./schemas/claude-headless.json') }) test('CaptureConfig allows minimal configuration', () => { const config: CaptureConfig = { promptsPath: '/tmp/prompts.jsonl', - agentCommand: ['echo', 'test'], + schemaPath: './test-schema.json', } expect(config.outputPath).toBeUndefined() @@ -151,13 +152,14 @@ describe('capture CLI', () => { const stdout = await new Response(proc.stdout).text() await proc.exited - expect(stdout).toContain('Usage: acp-harness capture') + expect(stdout).toContain('Usage: agent-eval-harness capture') expect(stdout).toContain('prompts.jsonl') expect(stdout).toContain('-o, --output') expect(stdout).toContain('-c, --cwd') expect(stdout).toContain('-t, --timeout') expect(stdout).toContain('--progress') expect(stdout).toContain('-g, --grader') + expect(stdout).toContain('-s, --schema') }) test('shows error for missing prompts file argument', async () => { @@ -173,7 +175,7 @@ describe('capture CLI', () => { expect(stderr).toContain('prompts.jsonl path is required') }) - test('shows error for missing agent command', async () => { + test('shows error for missing schema argument', async () => { const proc = Bun.spawn(['bun', './bin/cli.ts', 'capture', '/tmp/prompts.jsonl'], { stdout: 'pipe', stderr: 'pipe', @@ -183,6 +185,6 @@ describe('capture CLI', () => { const exitCode = await proc.exited expect(exitCode).not.toBe(0) - expect(stderr).toContain('ACP agent command is required') + expect(stderr).toContain('--schema is required') }) }) diff --git a/src/tests/capture-helpers.spec.ts b/src/tests/capture-helpers.spec.ts index 6b664d1..0fd1519 100644 --- a/src/tests/capture-helpers.spec.ts +++ b/src/tests/capture-helpers.spec.ts @@ -1,16 +1,15 @@ import { describe, expect, test } from 'bun:test' -import type { SessionNotification } from '@agentclientprotocol/sdk' import { detectTrajectoryRichness, extractContent, extractFilePath, extractOutput, - extractTokenCounts, extractTrajectory, hasToolErrors, headTailPreview, loadPrompts, } from '../capture.ts' +import type { ParsedUpdate } from '../headless-output-parser.ts' import type { TrajectoryStep } from '../schemas.ts' // ============================================================================ @@ -104,176 +103,106 @@ describe('loadPrompts', () => { describe('extractTrajectory', () => { const baseTime = 0 - test('extracts thoughts from agent_thought_chunk notifications', () => { - const notifications: SessionNotification[] = [ + test('extracts thoughts from thought type updates', () => { + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { - sessionUpdate: 'agent_thought_chunk', - content: { type: 'text', text: 'Let me think about this...' }, - }, + type: 'thought', + content: 'Let me think about this...', + raw: { type: 'thought', text: 'Let me think about this...' }, }, ] - const trajectory = extractTrajectory(notifications, baseTime) + const trajectory = extractTrajectory(updates, baseTime) expect(trajectory).toHaveLength(1) expect(trajectory[0]?.type).toBe('thought') - // Type narrowing after explicit assertion const step = trajectory[0]! expect(step.type === 'thought' && step.content).toBe('Let me think about this...') }) - test('extracts messages from agent_message_chunk notifications', () => { - const notifications: SessionNotification[] = [ + test('extracts messages from message type updates', () => { + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { - sessionUpdate: 'agent_message_chunk', - content: { type: 'text', text: 'Here is my answer.' }, - }, + type: 'message', + content: 'Here is my answer.', + raw: { type: 'message', text: 'Here is my answer.' }, }, ] - const trajectory = extractTrajectory(notifications, baseTime) + const trajectory = extractTrajectory(updates, baseTime) expect(trajectory).toHaveLength(1) expect(trajectory[0]?.type).toBe('message') - // Type narrowing after explicit assertion const step = trajectory[0]! expect(step.type === 'message' && step.content).toBe('Here is my answer.') }) - test('extracts tool calls with initial pending status', () => { - const notifications: SessionNotification[] = [ + test('extracts tool calls with title and status', () => { + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { - sessionUpdate: 'tool_call', - toolCallId: 't1', - title: 'Read', - status: 'pending', - rawInput: '{"file_path": "/test.ts"}', - }, + type: 'tool_call', + title: 'Read', + status: 'pending', + raw: { tool: 'Read', input: { file_path: '/test.ts' } }, }, ] - const trajectory = extractTrajectory(notifications, baseTime) + const trajectory = extractTrajectory(updates, baseTime) expect(trajectory).toHaveLength(1) expect(trajectory[0]?.type).toBe('tool_call') - // Type narrowing after explicit assertion const step = trajectory[0]! expect(step.type === 'tool_call' && step.name).toBe('Read') expect(step.type === 'tool_call' && step.status).toBe('pending') - expect(step.type === 'tool_call' && step.input).toBe('{"file_path": "/test.ts"}') - }) - - test('updates tool call status on subsequent notifications', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { - sessionUpdate: 'tool_call', - toolCallId: 't1', - title: 'Read', - status: 'pending', - }, - }, - { - sessionId: 's1', - update: { - sessionUpdate: 'tool_call', - toolCallId: 't1', - title: 'Read', - status: 'completed', - rawOutput: 'file contents here', - }, - }, - ] - - const trajectory = extractTrajectory(notifications, baseTime) - - // Should still be 1 entry, just updated - expect(trajectory).toHaveLength(1) - expect(trajectory[0]?.type).toBe('tool_call') - // Type narrowing after explicit assertion - const step = trajectory[0]! - expect(step.type === 'tool_call' && step.status).toBe('completed') - expect(step.type === 'tool_call' && step.output).toBe('file contents here') - }) - - test('tracks multiple independent tool calls', () => { - const notifications: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'Read', status: 'completed' }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't2', title: 'Write', status: 'completed' }, - }, - ] - - const trajectory = extractTrajectory(notifications, baseTime) - - expect(trajectory).toHaveLength(2) - expect(trajectory[0]?.type).toBe('tool_call') - expect(trajectory[1]?.type).toBe('tool_call') - // Type narrowing after explicit assertions - const step0 = trajectory[0]! - const step1 = trajectory[1]! - expect(step0.type === 'tool_call' && step0.name).toBe('Read') - expect(step1.type === 'tool_call' && step1.name).toBe('Write') }) - test('extracts plan entries', () => { - const notifications: SessionNotification[] = [ + test('extracts plan type updates', () => { + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { - sessionUpdate: 'plan', + type: 'plan', + raw: { entries: [ - { content: 'Step 1', status: 'completed', priority: 'high' }, - { content: 'Step 2', status: 'in_progress', priority: 'medium' }, + { content: 'Step 1', status: 'completed' }, + { content: 'Step 2', status: 'in_progress' }, ], }, }, ] - const trajectory = extractTrajectory(notifications, baseTime) + const trajectory = extractTrajectory(updates, baseTime) expect(trajectory).toHaveLength(1) expect(trajectory[0]?.type).toBe('plan') - // Type narrowing after explicit assertion + // Note: extractTrajectory creates plan entries from the update type + // but doesn't extract entries from raw (they are captured via output parser mappings) const step = trajectory[0]! - expect(step.type === 'plan' && step.entries).toHaveLength(2) + expect(step.type === 'plan').toBe(true) }) - test('handles empty notifications', () => { + test('handles empty updates', () => { const trajectory = extractTrajectory([], baseTime) expect(trajectory).toEqual([]) }) test('assigns timestamps relative to start time', () => { - // Mock Date.now to control timestamps const originalNow = Date.now try { let currentTime = 1000 Date.now = () => currentTime - const notifications: SessionNotification[] = [ + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'First' } }, + type: 'message', + content: 'First', + raw: { type: 'message', text: 'First' }, }, ] const startTime = 1000 currentTime = 1500 // 500ms later - const trajectory = extractTrajectory(notifications, startTime) + const trajectory = extractTrajectory(updates, startTime) expect(trajectory[0]?.timestamp).toBe(500) } finally { @@ -281,65 +210,26 @@ describe('extractTrajectory', () => { } }) - test('calculates tool call duration correctly', () => { - const originalNow = Date.now - try { - let currentTime = 1000 - - Date.now = () => currentTime - - const startTime = 1000 - - // Simulate time passing between notifications - // First notification at t=100 (currentTime = 1100) - // Second notification at t=600 (currentTime = 1600) - const notifications: SessionNotification[] = [] - - currentTime = 1100 // First call at 100ms relative to start - notifications.push({ - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'Bash', status: 'pending' }, - }) - - currentTime = 1600 // Second call at 600ms relative to start - notifications.push({ - sessionId: 's1', - update: { sessionUpdate: 'tool_call', toolCallId: 't1', title: 'Bash', status: 'completed' }, - }) - - // Now process all notifications in one call - // But the issue is extractTrajectory calls Date.now() for each notification - // so we need to mock it to return different values for each call - - let callCount = 0 - const times = [1100, 1600] - Date.now = () => times[callCount++] ?? 1600 - - const trajectory = extractTrajectory(notifications, startTime) - - expect(trajectory[0]?.type).toBe('tool_call') - // Type narrowing after explicit assertion - Duration should be 500ms (600 - 100) - const step = trajectory[0]! - expect(step.type === 'tool_call' && step.duration).toBe(500) - } finally { - Date.now = originalNow - } - }) - - test('ignores non-text content in thought chunks', () => { - const notifications: SessionNotification[] = [ + test('handles updates without content for message/thought types', () => { + const updates: ParsedUpdate[] = [ { - sessionId: 's1', - update: { - sessionUpdate: 'agent_thought_chunk', - // Image content should be skipped - content: { type: 'image', data: 'base64', mimeType: 'image/png' }, - }, + type: 'message', + content: undefined, // No content - will have empty string + raw: { type: 'message' }, + }, + { + type: 'message', + content: 'Has content', + raw: { type: 'message', text: 'Has content' }, }, ] - const trajectory = extractTrajectory(notifications, baseTime) - expect(trajectory).toHaveLength(0) + const trajectory = extractTrajectory(updates, baseTime) + + // Both messages are included - ones without content get empty string + expect(trajectory).toHaveLength(2) + expect(trajectory[0]?.type).toBe('message') + expect(trajectory[1]?.type).toBe('message') }) }) @@ -632,84 +522,3 @@ describe('detectTrajectoryRichness', () => { expect(detectTrajectoryRichness(trajectory)).toBe('full') }) }) - -// ============================================================================ -// extractTokenCounts -// ============================================================================ - -describe('extractTokenCounts', () => { - test('returns undefined when no usage data present', () => { - const updates: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hello' } }, - }, - ] - - const result = extractTokenCounts(updates) - - expect(result.inputTokens).toBeUndefined() - expect(result.outputTokens).toBeUndefined() - }) - - test('extracts token counts from usage field when present', () => { - const updates: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hello' } }, - // @ts-expect-error - SessionNotification type doesn't include 'usage' field, but adapters like Claude Code add it at runtime - usage: { inputTokens: 50, outputTokens: 30 }, - }, - ] - - const result = extractTokenCounts(updates) - - expect(result.inputTokens).toBe(50) - expect(result.outputTokens).toBe(30) - }) - - test('accumulates token counts across multiple updates', () => { - const updates: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'First' } }, - // @ts-expect-error - SessionNotification type doesn't include 'usage' field, but adapters like Claude Code add it at runtime - usage: { inputTokens: 50, outputTokens: 30 }, - }, - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Second' } }, - // @ts-expect-error - SessionNotification type doesn't include 'usage' field, but adapters like Claude Code add it at runtime - usage: { inputTokens: 25, outputTokens: 45 }, - }, - ] - - const result = extractTokenCounts(updates) - - expect(result.inputTokens).toBe(75) // 50 + 25 - expect(result.outputTokens).toBe(75) // 30 + 45 - }) - - test('handles empty updates array', () => { - const result = extractTokenCounts([]) - - expect(result.inputTokens).toBeUndefined() - expect(result.outputTokens).toBeUndefined() - }) - - test('handles partial token counts (only input or output)', () => { - const updates: SessionNotification[] = [ - { - sessionId: 's1', - update: { sessionUpdate: 'agent_message_chunk', content: { type: 'text', text: 'Hello' } }, - // @ts-expect-error - SessionNotification type doesn't include 'usage' field, but adapters like Claude Code add it at runtime - usage: { inputTokens: 100 }, - }, - ] - - const result = extractTokenCounts(updates) - - expect(result.inputTokens).toBe(100) - expect(result.outputTokens).toBeUndefined() - }) -}) diff --git a/src/tests/headless.spec.ts b/src/tests/headless.spec.ts index f9c2497..b32a301 100644 --- a/src/tests/headless.spec.ts +++ b/src/tests/headless.spec.ts @@ -86,7 +86,7 @@ describe('HeadlessAdapterSchema', () => { }) describe('validates schema files from disk', () => { - const schemasDir = '.claude/skills/acp-adapters/schemas' + const schemasDir = '.claude/skills/headless-adapters/schemas' test('validates claude-headless.json from disk', async () => { const content = await Bun.file(`${schemasDir}/claude-headless.json`).json() @@ -178,8 +178,8 @@ describe('HeadlessAdapterSchema', () => { expect(result.success).toBe(false) }) - test('rejects wrong version', () => { - const invalid = { ...validClaudeSchema, version: 2 } + test('rejects unsupported version', () => { + const invalid = { ...validClaudeSchema, version: 3 } const result = HeadlessAdapterSchema.safeParse(invalid) expect(result.success).toBe(false) }) diff --git a/src/tests/schemas.spec.ts b/src/tests/schemas.spec.ts index a7b013a..7df25a3 100644 --- a/src/tests/schemas.spec.ts +++ b/src/tests/schemas.spec.ts @@ -17,8 +17,6 @@ import { MessageStepSchema, PlanStepSchema, PromptCaseSchema, - RequestPermissionRequestSchema, - SessionNotificationSchema, SessionSchema, ThoughtStepSchema, TimingSchema, @@ -191,55 +189,6 @@ describe('JsonRpcMessageSchema', () => { }) }) -// ============================================================================ -// ACP SDK Type Schemas -// ============================================================================ - -describe('SessionNotificationSchema', () => { - test('validates session notification structure', () => { - const result = SessionNotificationSchema.safeParse({ - sessionId: 'sess_123', - update: { type: 'message' }, - }) - expect(result.success).toBe(true) - }) - - test('rejects missing sessionId', () => { - const result = SessionNotificationSchema.safeParse({ - update: { type: 'message' }, - }) - expect(result.success).toBe(false) - }) - - test('rejects missing update', () => { - const result = SessionNotificationSchema.safeParse({ - sessionId: 'sess_123', - }) - expect(result.success).toBe(false) - }) -}) - -describe('RequestPermissionRequestSchema', () => { - test('validates permission request with options array', () => { - const result = RequestPermissionRequestSchema.safeParse({ - options: [{ id: 1, label: 'Allow' }], - }) - expect(result.success).toBe(true) - }) - - test('rejects missing options', () => { - const result = RequestPermissionRequestSchema.safeParse({}) - expect(result.success).toBe(false) - }) - - test('rejects non-array options', () => { - const result = RequestPermissionRequestSchema.safeParse({ - options: 'not-an-array', - }) - expect(result.success).toBe(false) - }) -}) - // ============================================================================ // MCP Server Schemas // ============================================================================ diff --git a/src/tests/trials-cli.spec.ts b/src/tests/trials-cli.spec.ts index 88bb42d..f89d8c5 100644 --- a/src/tests/trials-cli.spec.ts +++ b/src/tests/trials-cli.spec.ts @@ -9,24 +9,25 @@ describe('TrialsConfig configuration', () => { test('TrialsConfig type accepts valid configuration', () => { const config: TrialsConfig = { promptsPath: '/tmp/prompts.jsonl', - agentCommand: ['bunx', 'test-agent'], + schemaPath: './schemas/claude-headless.json', k: 5, outputPath: '/tmp/output.jsonl', cwd: '/tmp', timeout: 30000, progress: true, append: false, + debug: false, } expect(config.promptsPath).toBe('/tmp/prompts.jsonl') - expect(config.agentCommand).toEqual(['bunx', 'test-agent']) + expect(config.schemaPath).toBe('./schemas/claude-headless.json') expect(config.k).toBe(5) }) test('TrialsConfig allows minimal configuration', () => { const config: TrialsConfig = { promptsPath: '/tmp/prompts.jsonl', - agentCommand: ['echo', 'test'], + schemaPath: './test-schema.json', k: 3, } @@ -53,7 +54,7 @@ describe('trials CLI', () => { const stdout = await new Response(proc.stdout).text() await proc.exited - expect(stdout).toContain('Usage: acp-harness trials') + expect(stdout).toContain('Usage: agent-eval-harness trials') expect(stdout).toContain('prompts.jsonl') expect(stdout).toContain('-o, --output') expect(stdout).toContain('-k') @@ -61,6 +62,7 @@ describe('trials CLI', () => { expect(stdout).toContain('-t, --timeout') expect(stdout).toContain('--progress') expect(stdout).toContain('-g, --grader') + expect(stdout).toContain('-s, --schema') expect(stdout).toContain('pass@k') }) @@ -77,7 +79,7 @@ describe('trials CLI', () => { expect(stderr).toContain('prompts.jsonl path is required') }) - test('shows error for missing agent command', async () => { + test('shows error for missing schema argument', async () => { const proc = Bun.spawn(['bun', './bin/cli.ts', 'trials', '/tmp/prompts.jsonl'], { stdout: 'pipe', stderr: 'pipe', @@ -87,7 +89,7 @@ describe('trials CLI', () => { const exitCode = await proc.exited expect(exitCode).not.toBe(0) - expect(stderr).toContain('ACP agent command is required') + expect(stderr).toContain('--schema is required') }) }) @@ -105,7 +107,7 @@ describe('schemas CLI', () => { const stdout = await new Response(proc.stdout).text() await proc.exited - expect(stdout).toContain('Usage: acp-harness schemas') + expect(stdout).toContain('Usage: agent-eval-harness schemas') expect(stdout).toContain('-o, --output') expect(stdout).toContain('-j, --json') expect(stdout).toContain('-s, --split') diff --git a/src/trials.ts b/src/trials.ts index cbadfaa..e8e3aa7 100644 --- a/src/trials.ts +++ b/src/trials.ts @@ -13,11 +13,12 @@ import { appendFile } from 'node:fs/promises' import { parseArgs } from 'node:util' -import { createACPClient } from './acp-client.ts' -import { createPrompt } from './acp-helpers.ts' import { extractOutput, extractTrajectory, loadPrompts } from './capture.ts' import { DEFAULT_HARNESS_TIMEOUT, DEFAULT_TRIAL_COUNT } from './constants.ts' import { loadGrader } from './grader-loader.ts' +import { type HeadlessAdapterConfig, parseHeadlessConfig } from './headless.schemas.ts' +import type { ParsedUpdate } from './headless-output-parser.ts' +import { createSessionManager } from './headless-session-manager.ts' import type { Grader, TrialEntry, TrialResult } from './schemas.ts' // ============================================================================ @@ -77,15 +78,15 @@ export const calculatePassExpK = (passes: number, k: number): number => { export type TrialsConfig = { /** Path to prompts.jsonl file */ promptsPath: string - /** ACP agent command */ - agentCommand: string[] + /** Path to agent schema JSON file */ + schemaPath: string /** Number of trials per prompt */ k: number /** Output file path */ outputPath?: string /** Working directory for agent */ cwd?: string - /** Timeout per prompt in milliseconds */ + /** Timeout per prompt in milliseconds (overrides schema default) */ timeout?: number /** Show progress to stderr */ progress?: boolean @@ -93,6 +94,8 @@ export type TrialsConfig = { append?: boolean /** Optional grader function */ grader?: Grader + /** Enable debug mode */ + debug?: boolean } // ============================================================================ @@ -139,35 +142,56 @@ const logProgress = (message: string, showProgress: boolean): void => { export const runTrials = async (config: TrialsConfig): Promise => { const { promptsPath, - agentCommand, + schemaPath, k, outputPath, cwd, - timeout = DEFAULT_HARNESS_TIMEOUT, + timeout, progress = false, append = false, grader, + debug = false, } = config + // Load and validate schema + const schemaFile = Bun.file(schemaPath) + if (!(await schemaFile.exists())) { + throw new Error(`Schema file not found: ${schemaPath}`) + } + + let schema: HeadlessAdapterConfig + try { + const rawSchema = await schemaFile.json() + schema = parseHeadlessConfig(rawSchema) + } catch (error) { + throw new Error(`Invalid schema: ${error instanceof Error ? error.message : String(error)}`) + } + // Load prompts const prompts = await loadPrompts(promptsPath) // Resolve output path const resolvedOutputPath = outputPath ? resolvePath(outputPath) : undefined + // Determine effective timeout (CLI flag > schema default > harness default) + const schemaTimeout = 'timeout' in schema ? schema.timeout : undefined + const effectiveTimeout = timeout ?? schemaTimeout ?? DEFAULT_HARNESS_TIMEOUT + // Log progress info logProgress(`Loaded ${prompts.length} prompts from ${promptsPath}`, progress) logProgress(`Running ${k} trials per prompt`, progress) - logProgress(`Command: ${agentCommand.join(' ')}`, progress) + logProgress(`Schema: ${schema.name} (${schemaPath})`, progress) + logProgress(`Timeout: ${effectiveTimeout}ms`, progress) if (grader) { logProgress('Grader: enabled (will compute pass@k metrics)', progress) } - // Create ACP client - const client = createACPClient({ - command: agentCommand, - cwd, - timeout, + // Create session manager with schema + const sessions = createSessionManager({ + schema, + timeout: effectiveTimeout, + verbose: progress, + debug, }) // Clear output file if not appending @@ -175,117 +199,115 @@ export const runTrials = async (config: TrialsConfig): Promise => await Bun.write(resolvedOutputPath, '') } - // Session params - agents auto-discover MCP configs from cwd - const sessionParams = { - cwd: cwd ?? process.cwd(), - } - + const workingDir = cwd ?? process.cwd() const results: TrialResult[] = [] let isFirstOutput = true - try { - logProgress('Connecting to agent...', progress) - await client.connect() - logProgress('Connected!', progress) + // Run evaluations + for (let i = 0; i < prompts.length; i++) { + const promptCase = prompts[i] + if (!promptCase) continue - // Run evaluations - for (let i = 0; i < prompts.length; i++) { - const promptCase = prompts[i] - if (!promptCase) continue + logProgress(`[${i + 1}/${prompts.length}] ${promptCase.id}: Running ${k} trials...`, progress) - logProgress(`[${i + 1}/${prompts.length}] ${promptCase.id}: Running ${k} trials...`, progress) + const trialEntries: TrialEntry[] = [] - const trialEntries: TrialEntry[] = [] + for (let trialNum = 1; trialNum <= k; trialNum++) { + // Create fresh session for each trial + const session = await sessions.create(workingDir) + const startTime = Date.now() - for (let trialNum = 1; trialNum <= k; trialNum++) { - // Create fresh session for each trial - const session = await client.createSession(sessionParams) - const startTime = Date.now() + try { + // Handle string or array input + const inputs = Array.isArray(promptCase.input) ? promptCase.input : [promptCase.input] + const allUpdates: ParsedUpdate[] = [] + + // TODO: Per-prompt timeout from promptCase.timeout is documented but not yet implemented + + // Execute each turn sequentially + for (const turnInput of inputs) { + const turnResult = await sessions.prompt(session.id, turnInput) + allUpdates.push(...turnResult.updates) + } - try { - const inputText = Array.isArray(promptCase.input) ? promptCase.input.join('\n') : promptCase.input - const prompt = createPrompt(inputText) - const { updates } = await client.promptSync(session.id, prompt) + const endTime = Date.now() + const trajectory = extractTrajectory(allUpdates, startTime) + const output = extractOutput(trajectory) - const endTime = Date.now() - const trajectory = extractTrajectory(updates, startTime) - const output = extractOutput(trajectory) + const entry: TrialEntry = { + trialNum, + output, + trajectory, + duration: endTime - startTime, + } - const entry: TrialEntry = { - trialNum, + // Apply grader if provided + if (grader) { + const graderResult = await grader({ + input: promptCase.input, output, + hint: promptCase.hint, trajectory, - duration: endTime - startTime, - } - - // Apply grader if provided - if (grader) { - const graderResult = await grader({ - input: promptCase.input, - output, - hint: promptCase.hint, - trajectory, - }) - entry.pass = graderResult.pass - entry.score = graderResult.score - entry.reasoning = graderResult.reasoning - } - - trialEntries.push(entry) - logProgress( - ` Trial ${trialNum}/${k}: ${entry.pass !== undefined ? (entry.pass ? '✓' : '✗') : '?'}`, - progress, - ) - } catch (error) { - const endTime = Date.now() - const message = error instanceof Error ? error.message : String(error) - - trialEntries.push({ - trialNum, - output: '', - trajectory: [], - duration: endTime - startTime, - pass: false, - reasoning: `Error: ${message}`, }) - logProgress(` Trial ${trialNum}/${k}: ! (error)`, progress) + entry.pass = graderResult.pass + entry.score = graderResult.score + entry.reasoning = graderResult.reasoning } - } - // Build result - const result: TrialResult = { - id: promptCase.id, - input: promptCase.input, - ...(promptCase.hint && { hint: promptCase.hint }), - k, - trials: trialEntries, - } + trialEntries.push(entry) + logProgress( + ` Trial ${trialNum}/${k}: ${entry.pass !== undefined ? (entry.pass ? '✓' : '✗') : '?'}`, + progress, + ) - // Calculate metrics if grader was used - if (grader) { - const passes = trialEntries.filter((t) => t.pass).length - result.passRate = passes / k - result.passAtK = calculatePassAtK(passes, k) - result.passExpK = calculatePassExpK(passes, k) + // Clean up session + sessions.destroy(session.id) + } catch (error) { + const endTime = Date.now() + const message = error instanceof Error ? error.message : String(error) + + trialEntries.push({ + trialNum, + output: '', + trajectory: [], + duration: endTime - startTime, + pass: false, + reasoning: `Error: ${message}`, + }) + logProgress(` Trial ${trialNum}/${k}: ! (error)`, progress) } + } - results.push(result) + // Build result + const result: TrialResult = { + id: promptCase.id, + input: promptCase.input, + ...(promptCase.hint && { hint: promptCase.hint }), + k, + trials: trialEntries, + } - // Write result immediately - const formatted = JSON.stringify(result) - await writeOutput(formatted, resolvedOutputPath, !isFirstOutput) - isFirstOutput = false + // Calculate metrics if grader was used + if (grader) { + const passes = trialEntries.filter((t) => t.pass).length + result.passRate = passes / k + result.passAtK = calculatePassAtK(passes, k) + result.passExpK = calculatePassExpK(passes, k) + } - if (grader) { - logProgress( - ` → passRate=${(result.passRate ?? 0).toFixed(2)}, pass@${k}=${(result.passAtK ?? 0).toFixed(2)}`, - progress, - ) - } + results.push(result) + + // Write result immediately + const formatted = JSON.stringify(result) + await writeOutput(formatted, resolvedOutputPath, !isFirstOutput) + isFirstOutput = false + + if (grader) { + logProgress( + ` → passRate=${(result.passRate ?? 0).toFixed(2)}, pass@${k}=${(result.passAtK ?? 0).toFixed(2)}`, + progress, + ) } - } finally { - logProgress('Disconnecting...', progress) - await client.disconnect() } logProgress('Done!', progress) @@ -305,13 +327,15 @@ export const trials = async (args: string[]): Promise => { const { values, positionals } = parseArgs({ args, options: { + schema: { type: 'string', short: 's' }, output: { type: 'string', short: 'o' }, k: { type: 'string', short: 'k', default: String(DEFAULT_TRIAL_COUNT) }, cwd: { type: 'string', short: 'c' }, - timeout: { type: 'string', short: 't', default: String(DEFAULT_HARNESS_TIMEOUT) }, + timeout: { type: 'string', short: 't' }, progress: { type: 'boolean', default: false }, append: { type: 'boolean', default: false }, grader: { type: 'string', short: 'g' }, + debug: { type: 'boolean', default: false }, help: { type: 'boolean', short: 'h' }, }, allowPositionals: true, @@ -320,20 +344,21 @@ export const trials = async (args: string[]): Promise => { if (values.help) { // biome-ignore lint/suspicious/noConsole: CLI help output console.log(` -Usage: acp-harness trials [args...] [options] +Usage: agent-eval-harness trials --schema [options] Arguments: prompts.jsonl Input file with evaluation prompts - command [args] ACP agent command to execute Options: + -s, --schema Path to agent schema JSON file (required) -o, --output Output file (default: stdout) -k Number of trials per prompt (default: ${DEFAULT_TRIAL_COUNT}) - -c, --cwd Working directory for agent (agents auto-discover MCP configs from here) - -t, --timeout Request timeout in ms (default: ${DEFAULT_HARNESS_TIMEOUT}) + -c, --cwd Working directory for agent + -t, --timeout Request timeout in ms (overrides schema default) --progress Show progress to stderr --append Append to output file -g, --grader Path to grader (.ts/.js module or executable script) + --debug Enable debug mode -h, --help Show this help message Output Format: @@ -346,13 +371,13 @@ Graders: Examples: # Capture only - acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 -o trials.jsonl + agent-eval-harness trials prompts.jsonl -s claude.json -k 5 -o trials.jsonl # With TypeScript grader - acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.ts -o trials.jsonl + agent-eval-harness trials prompts.jsonl -s claude.json -k 5 --grader ./grader.ts -o trials.jsonl # With Python grader - acp-harness trials prompts.jsonl bunx claude-code-acp -k 5 --grader ./grader.py -o trials.jsonl + agent-eval-harness trials prompts.jsonl -s claude.json -k 5 --grader ./grader.py -o trials.jsonl `) return } @@ -363,9 +388,9 @@ Examples: process.exit(1) } - const agentCommand = positionals.slice(1) - if (agentCommand.length === 0) { - console.error('Error: ACP agent command is required') + if (!values.schema) { + console.error('Error: --schema is required') + console.error('Example: agent-eval-harness trials prompts.jsonl --schema ./claude.json') process.exit(1) } @@ -382,13 +407,14 @@ Examples: await runTrials({ promptsPath, - agentCommand, + schemaPath: values.schema, k: Number.parseInt(values.k ?? String(DEFAULT_TRIAL_COUNT), 10), outputPath: values.output, cwd: values.cwd, - timeout: Number.parseInt(values.timeout ?? String(DEFAULT_HARNESS_TIMEOUT), 10), + timeout: values.timeout ? Number.parseInt(values.timeout, 10) : undefined, progress: values.progress ?? false, append: values.append ?? false, grader, + debug: values.debug ?? false, }) } From 4383e3f1763ef82c199f55b48e0c210f820b4864 Mon Sep 17 00:00:00 2001 From: Edward Irby Date: Wed, 21 Jan 2026 13:21:50 -0800 Subject: [PATCH 02/13] chore: complete ACP terminology cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rename asset files: Dockerfile.acp → Dockerfile.eval, docker-compose.acp.yml → docker-compose.eval.yml - Update README.md with new package name and CLI examples - Rename constants: ACP_METHODS → PROTOCOL_METHODS, ACP_PROTOCOL_VERSION → PROTOCOL_VERSION - Update CI workflow to use generic filter names - Update all skill documentation to remove ACP references - Update rules examples to use generic terms - Fix GitHub URLs in package.json Co-Authored-By: Claude Opus 4.5 --- .claude/ralph-loop.local.md | 11 +++ .../{Dockerfile.acp => Dockerfile.eval} | 0 ...ompose.acp.yml => docker-compose.eval.yml} | 0 .../references/docker-evals.md | 4 +- .../references/downstream.md | 2 +- .../references/output-formats.md | 2 +- .../references/schema-creation-guide.md | 6 +- .../references/troubleshooting-guide.md | 2 +- .github/CODEOWNERS | 2 +- .github/workflows/ci.yml | 10 +-- .plaited/rules/code-review.md | 4 +- .plaited/rules/module-organization.md | 16 ++-- .plaited/rules/testing.md | 4 +- README.md | 75 ++++++++----------- bin/cli.ts | 2 +- package.json | 6 +- src/balance.ts | 6 +- src/calibrate.ts | 6 +- src/constants.ts | 26 +++---- src/harness.ts | 2 +- src/headless-cli.ts | 16 ++-- src/headless-output-parser.ts | 6 +- src/headless-session-manager.ts | 4 +- src/headless.schemas.ts | 12 +-- src/headless.types.ts | 2 +- src/schemas.ts | 4 +- src/summarize.ts | 6 +- src/tests/capture-cli.spec.ts | 2 +- src/tests/constants.spec.ts | 50 ++++++------- src/tests/fixtures/calculator-mcp.ts | 2 +- src/tests/headless.spec.ts | 2 +- src/tests/schemas-cli.spec.ts | 2 +- src/validate-refs.ts | 4 +- 33 files changed, 150 insertions(+), 148 deletions(-) create mode 100644 .claude/ralph-loop.local.md rename .claude/skills/agent-eval-harness/assets/{Dockerfile.acp => Dockerfile.eval} (100%) rename .claude/skills/agent-eval-harness/assets/{docker-compose.acp.yml => docker-compose.eval.yml} (100%) diff --git a/.claude/ralph-loop.local.md b/.claude/ralph-loop.local.md new file mode 100644 index 0000000..ff9849a --- /dev/null +++ b/.claude/ralph-loop.local.md @@ -0,0 +1,11 @@ +--- +active: true +iteration: 1 +max_iterations: 0 +completion_promise: null +started_at: "2026-01-21T21:13:26Z" +--- + +rename asset files scan for acp mention and remove and update. Sacn @README.md and update. When done scab entire project + for thing to cleanup and + remove diff --git a/.claude/skills/agent-eval-harness/assets/Dockerfile.acp b/.claude/skills/agent-eval-harness/assets/Dockerfile.eval similarity index 100% rename from .claude/skills/agent-eval-harness/assets/Dockerfile.acp rename to .claude/skills/agent-eval-harness/assets/Dockerfile.eval diff --git a/.claude/skills/agent-eval-harness/assets/docker-compose.acp.yml b/.claude/skills/agent-eval-harness/assets/docker-compose.eval.yml similarity index 100% rename from .claude/skills/agent-eval-harness/assets/docker-compose.acp.yml rename to .claude/skills/agent-eval-harness/assets/docker-compose.eval.yml diff --git a/.claude/skills/agent-eval-harness/references/docker-evals.md b/.claude/skills/agent-eval-harness/references/docker-evals.md index 283473b..05e3978 100644 --- a/.claude/skills/agent-eval-harness/references/docker-evals.md +++ b/.claude/skills/agent-eval-harness/references/docker-evals.md @@ -1,6 +1,6 @@ # Running Evals in Docker -Docker provides a consistent, isolated environment for running ACP evaluations. This guide covers lessons learned from real debugging sessions. +Docker provides a consistent, isolated environment for running agent evaluations. This guide covers lessons learned from real debugging sessions. ## Why Docker? @@ -147,7 +147,7 @@ test-integration: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - - name: Run ACP integration tests + - name: Run integration tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} diff --git a/.claude/skills/agent-eval-harness/references/downstream.md b/.claude/skills/agent-eval-harness/references/downstream.md index b9160dd..d3edace 100644 --- a/.claude/skills/agent-eval-harness/references/downstream.md +++ b/.claude/skills/agent-eval-harness/references/downstream.md @@ -372,7 +372,7 @@ jobs: - uses: actions/checkout@v4 - uses: oven-sh/setup-bun@v2 - - name: Install ACP adapter + - name: Install harness run: npm install -g @zed-industries/claude-code-acp - name: Install dependencies diff --git a/.claude/skills/agent-eval-harness/references/output-formats.md b/.claude/skills/agent-eval-harness/references/output-formats.md index a73b65e..daec2c9 100644 --- a/.claude/skills/agent-eval-harness/references/output-formats.md +++ b/.claude/skills/agent-eval-harness/references/output-formats.md @@ -35,7 +35,7 @@ type TrajectoryStep = | { type: 'message'; content: string; timestamp: number; stepId?: string } | { type: 'tool_call' - name: string // Tool title from ACP SDK + name: string // Tool title status: string // pending, in_progress, completed, failed input?: unknown // Raw input parameters output?: unknown // Raw output diff --git a/.claude/skills/headless-adapters/references/schema-creation-guide.md b/.claude/skills/headless-adapters/references/schema-creation-guide.md index d6004fb..4de92e1 100644 --- a/.claude/skills/headless-adapters/references/schema-creation-guide.md +++ b/.claude/skills/headless-adapters/references/schema-creation-guide.md @@ -4,7 +4,7 @@ Step-by-step workflow for creating headless adapter schemas for CLI coding agent ## Overview -The headless adapter transforms any CLI agent with JSON output into an ACP-compatible adapter. You just need a schema file describing how to interact with the CLI. +The headless adapter transforms any CLI agent with JSON output into a protocol-compatible adapter. You just need a schema file describing how to interact with the CLI. ## Workflow @@ -81,7 +81,7 @@ AGENT_API_KEY=... exec -o stream-json "Say hello" | jq -c '.' Analyze the output to create event mappings: -| JSON Event | ACP Event Type | Extract Fields | +| JSON Event | Event Type | Extract Fields | |------------|---------------|----------------| | `{"type": "message", ...}` | `message` | `$.content` | | `{"type": "tool_use", ...}` | `tool_call` | `$.name` (title), `"pending"` (status) | @@ -213,7 +213,7 @@ Debug mode shows: **Not yet compatible:** [Copilot CLI](https://docs.github.com/en/copilot/concepts/agents/about-copilot-cli) (no JSON output) -> **Note:** For detailed ACP protocol questions during schema creation, use the `agent-client-protocol-docs` MCP server. See SKILL.md for configuration. +> **Note:** For detailed protocol questions during schema creation, use the `agent-client-protocol-docs` MCP server. See SKILL.md for configuration. ## Troubleshooting diff --git a/.claude/skills/headless-adapters/references/troubleshooting-guide.md b/.claude/skills/headless-adapters/references/troubleshooting-guide.md index 03d1c43..b943f43 100644 --- a/.claude/skills/headless-adapters/references/troubleshooting-guide.md +++ b/.claude/skills/headless-adapters/references/troubleshooting-guide.md @@ -463,7 +463,7 @@ Output events use a two-step process: This means: - Check if `$.type` equals `"message"` -- If yes, emit an ACP `message` update +- If yes, emit a session `message` update - Extract content from `$.text` ### Wildcard Matching diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index c7012ff..81b3564 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,4 +1,4 @@ -# Code owners for acp-harness repository +# Code owners for agent-eval-harness repository # These users will be automatically requested for review when someone opens a pull request. # See https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 2aeea54..ea55cde 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -16,14 +16,14 @@ jobs: permissions: pull-requests: read outputs: - acp: ${{ steps.filter.outputs.acp }} + src: ${{ steps.filter.outputs.src }} steps: - uses: actions/checkout@v4 - uses: dorny/paths-filter@v3 id: filter with: filters: | - acp: + src: - 'src/**' test-pr: @@ -46,14 +46,14 @@ jobs: # - GEMINI_API_KEY: API key for Gemini CLI integration tests test-integration: needs: changes - if: ${{ needs.changes.outputs.acp == 'true' }} + if: ${{ needs.changes.outputs.src == 'true' }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - - name: Run ACP integration tests + - name: Run integration tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} - run: docker compose -f docker-compose.test.yml run --rm acp-test + run: docker compose -f docker-compose.test.yml run --rm test diff --git a/.plaited/rules/code-review.md b/.plaited/rules/code-review.md index db316cc..138dd95 100644 --- a/.plaited/rules/code-review.md +++ b/.plaited/rules/code-review.md @@ -100,14 +100,14 @@ const createClient = ({ command: string[] timeout: number cwd?: string -}): ACPClient => { /* ... */ } +}): SessionManager => { /* ... */ } // ❌ Avoid: Multiple positional parameters const createClient = ( command: string[], timeout: number, cwd?: string -): ACPClient => { /* ... */ } +): SessionManager => { /* ... */ } ``` **Exception - CLI Entry Points:** CLI functions take `args: string[]` because that's what the shell provides—parsing happens inside the function. This rule applies to internal APIs where callers pass typed values directly. diff --git a/.plaited/rules/module-organization.md b/.plaited/rules/module-organization.md index cf3dbb8..5d1c31e 100644 --- a/.plaited/rules/module-organization.md +++ b/.plaited/rules/module-organization.md @@ -10,11 +10,11 @@ Use named re-export files at the parent level, matching the folder name: ``` src/ -├── acp/ # Feature module -│ ├── acp.types.ts -│ ├── acp.schemas.ts -│ └── acp.ts # Main implementation -├── acp.ts # Re-exports public API from acp/ +├── capture/ # Feature module +│ ├── capture.types.ts +│ ├── capture.schemas.ts +│ └── capture.ts # Main implementation +├── capture.ts # Re-exports public API from capture/ ├── utils/ │ └── format.ts └── utils.ts # Re-exports public API from utils/ @@ -26,9 +26,9 @@ When a package has one primary feature, expose that re-export file directly as m ```json { - "main": "src/acp.ts", + "main": "src/capture.ts", "exports": { - ".": "./src/acp.ts", + ".": "./src/capture.ts", "./utils": "./src/utils.ts" } } @@ -43,7 +43,7 @@ Always include `.ts` extensions in imports. Bun runs TypeScript natively—no co ```typescript // ✅ Good import { Config } from './module.types.ts' -import { createClient } from '../acp/acp.ts' +import { createClient } from '../capture/capture.ts' // ❌ Avoid import { Config } from './module.types' diff --git a/.plaited/rules/testing.md b/.plaited/rules/testing.md index 3d4eead..a39fe3e 100644 --- a/.plaited/rules/testing.md +++ b/.plaited/rules/testing.md @@ -31,12 +31,12 @@ Use `test` instead of `it` in test files for consistency: ```typescript // ✅ Good -test('should create ACP client correctly', () => { +test('should create session manager correctly', () => { // ... }) // ❌ Avoid -it('should create ACP client correctly', () => { +it('should create session manager correctly', () => { // ... }) ``` diff --git a/README.md b/README.md index d97dc0a..2808473 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ -# @plaited/acp-harness +# @plaited/agent-eval-harness -[![npm version](https://img.shields.io/npm/v/@plaited/acp-harness.svg)](https://www.npmjs.com/package/@plaited/acp-harness) -[![CI](https://github.com/plaited/acp-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/acp-harness/actions/workflows/ci.yml) +[![npm version](https://img.shields.io/npm/v/@plaited/agent-eval-harness.svg)](https://www.npmjs.com/package/@plaited/agent-eval-harness) +[![CI](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml) [![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC) -CLI tool for capturing agent trajectories from ACP-compatible agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents. +CLI tool for capturing agent trajectories from headless CLI agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents. ## CLI Tool @@ -13,59 +13,51 @@ Use these tools directly via the CLI without installation: ```bash # Using built-in headless adapter (recommended - no extra install needed) export ANTHROPIC_API_KEY=sk-... -bunx @plaited/acp-harness capture prompts.jsonl \ - bunx @plaited/acp-harness headless --schema ./schemas/claude-headless.json \ +bunx @plaited/agent-eval-harness capture prompts.jsonl \ + --schema ./schemas/claude-headless.json \ -o results.jsonl - -# Or with an external ACP adapter -bunx @plaited/acp-harness capture prompts.jsonl bunx claude-code-acp -o results.jsonl ``` -**Prerequisite:** Set your API key. The `headless` command works with any CLI agent that supports JSON output - no adapter installation required: +**Prerequisite:** Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it: ```bash export ANTHROPIC_API_KEY=sk-... # For Claude export GEMINI_API_KEY=... # For Gemini ``` -Pre-built schemas are available in `.claude/skills/acp-adapters/schemas/` for Claude and Gemini. +Pre-built schemas are available in `.claude/skills/headless-adapters/schemas/` for Claude and Gemini. ### Commands | Command | Description | |---------|-------------| -| `capture ` | Trajectory capture (full JSONL) | -| `trials ` | Multi-run with pass@k metrics | +| `capture --schema ` | Trajectory capture (full JSONL) | +| `trials --schema ` | Multi-run with pass@k metrics | | `summarize ` | Derive compact views from results | | `calibrate ` | Sample failures for review | | `validate-refs ` | Check reference solutions | | `balance ` | Analyze test set coverage | | `schemas [name]` | Export JSON schemas | | `headless --schema ` | Schema-driven adapter for any CLI agent | -| `adapter:check ` | Validate adapter ACP compliance | ### Examples ```bash # Capture trajectories using headless adapter (recommended) -bunx @plaited/acp-harness capture prompts.jsonl \ - bunx @plaited/acp-harness headless --schema ./schemas/claude-headless.json \ +bunx @plaited/agent-eval-harness capture prompts.jsonl \ + --schema ./schemas/claude-headless.json \ -o results.jsonl -# Run trials for pass@k analysis -bunx @plaited/acp-harness trials prompts.jsonl \ - bunx @plaited/acp-harness headless --schema ./schemas/claude-headless.json \ - -k 5 --grader ./grader.ts +# Run trials for pass@k analysis with debug mode +bunx @plaited/agent-eval-harness trials prompts.jsonl \ + --schema ./schemas/claude-headless.json \ + -k 5 --grader ./grader.ts --debug # Summarize results -bunx @plaited/acp-harness summarize results.jsonl -o summary.jsonl +bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl # Export schemas -bunx @plaited/acp-harness schemas CaptureResult --json - -# Validate adapter compliance -bunx @plaited/acp-harness adapter:check \ - bunx @plaited/acp-harness headless --schema ./schemas/claude-headless.json +bunx @plaited/agent-eval-harness schemas CaptureResult --json ``` ## Skills for AI Agents @@ -73,14 +65,14 @@ bunx @plaited/acp-harness adapter:check \ **Install skills** for use with AI coding agents: ```bash -curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project acp-harness +curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agent --project agent-eval-harness ``` Replace `` with your agent: `claude`, `cursor`, `copilot`, `opencode`, `amp`, `goose`, `factory` ### Available Skills -#### ACP Harness +#### Agent Eval Harness CLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript projects using Bun. @@ -102,23 +94,20 @@ CLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript p - Building regression test fixtures for agent behavior - Comparing agent responses across configurations -#### ACP Adapters +#### Headless Adapters -Discover, create, and validate ACP adapters for agent integration. +Schema-driven adapters for headless CLI agent integration. **Commands:** | Command | Description | |---------|-------------| | `headless` | Schema-driven adapter for any CLI agent | -| `adapter:scaffold` | Generate new adapter project with handlers | -| `adapter:check` | Validate ACP protocol compliance | **Use cases:** - Wrapping headless CLI agents with schema-driven adapter - Finding existing adapters for your agent -- Building custom ACP adapters from scratch -- Validating adapter implementations +- Creating new schemas for CLI agents ## Input Format @@ -134,6 +123,7 @@ Discover, create, and validate ACP adapters for agent integration. | `hint` | No | Grader context - what to look for | | `reference` | No | Reference solution (for validate-refs) | | `metadata` | No | Tags, category, difficulty for filtering | +| `timeout` | No | Override default timeout for this prompt (ms) | ## Output Format @@ -146,9 +136,10 @@ The harness outputs full trajectory JSONL (`CaptureResult` schema): "output": "Here's a button component...", "hint": "should contain