Skip to content

Conversation

@EdwardIrby
Copy link
Member

Summary

  • Unix-style pipeline commands: run, extract, grade, format, compare for composable evaluation workflows
  • Core module extraction: Shared utilities in src/core/ (loading, trajectory, output)
  • Codebase reorganization: 1-level-deep module structure under src/
  • Documentation updates: Updated SKILL.md, README.md, AGENTS.md with pipeline examples

Key Changes

Pipeline Commands

cat prompts.jsonl | \
  agent-eval-harness run -s claude.json | \
  agent-eval-harness extract -s claude.json | \
  agent-eval-harness grade -g ./grader.ts | \
  agent-eval-harness format -f markdown > report.md

New Compare Command

agent-eval-harness compare run1.jsonl run2.jsonl \
  --grader ./compare-grader.ts -o comparison.jsonl

Directory Structure

src/
├── commands/     # CLI commands (capture, trials, etc.)
├── core/         # Shared utilities
├── headless/     # Headless adapter system
├── pipeline/     # Pipeline commands
├── schemas/      # Zod schemas
└── *.ts          # Re-export files

Test plan

  • All 405 unit tests passing
  • Type checking passes (bun run check)
  • Docker integration tests with API keys

🤖 Generated with Claude Code

EdwardIrby and others added 10 commits January 21, 2026 13:12
BREAKING CHANGE: Package renamed from @plaited/acp-harness to @plaited/agent-eval-harness

Major changes:
- Remove ACP SDK dependency and all ACP protocol handling
- Capture/trials now use headless session manager directly
- Add debug mode (--debug) for verbose JSONPath matching output
- Add exit code/signal tracking with ProcessExitInfo type
- Add schema v2 support with timeout field

Skill renames:
- acp-harness → agent-eval-harness
- acp-adapters → headless-adapters

CLI changes:
- capture/trials now require --schema flag (no positional agent command)
- Remove adapter:check and adapter:scaffold commands

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename asset files: Dockerfile.acp → Dockerfile.eval, docker-compose.acp.yml → docker-compose.eval.yml
- Update README.md with new package name and CLI examples
- Rename constants: ACP_METHODS → PROTOCOL_METHODS, ACP_PROTOCOL_VERSION → PROTOCOL_VERSION
- Update CI workflow to use generic filter names
- Update all skill documentation to remove ACP references
- Update rules examples to use generic terms
- Fix GitHub URLs in package.json

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates Docker service name across all documentation and compose files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update code examples in rules to use current naming (SessionManager, harness.ts)
- Remove agent-skills-spec and agent-client-protocol MCP servers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename acp-*.spec.ts to claude.spec.ts/gemini.spec.ts
- Use createSessionManager instead of removed createACPClient
- Load JSON schemas properly with Bun.file().json() before parsing
- Fix Gemini schema contentPath from $.stats to $.content
- Make math test resilient to Gemini output formatting variations

All 12 integration tests pass in Docker.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements "BASH Is All You Need" refactoring following Unix philosophy:
composable, single-purpose tools that can be piped together.

Core module extraction (src/core/):
- loading.ts: loadPrompts(), loadResults(), loadJsonl()
- trajectory.ts: extractTrajectory(), extractOutput(), hasToolErrors()
- output.ts: writeOutput(), logProgress(), headTailPreview()

Pipeline commands (src/pipeline/):
- run: execute prompts in schema/simple/shell modes
- extract: parse raw output into trajectories
- grade: apply grader functions to results
- format: convert to jsonl/markdown/csv
- compare: compare multiple runs with ranking

Schema enhancements:
- Add passthrough mode for well-structured agent output
- Consolidate to single schema version (prototype stage)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restructure codebase following module organization rules:
- commands/: CLI command implementations (capture, trials, etc.)
- headless/: Schema-driven headless adapter system
- schemas/: Zod schemas, constants, grader loader
- core/: Shared utilities (loading, trajectory, output)
- pipeline/: Unix-style pipeline commands (existing)

Each module has tests/ subdirectory with fixtures where needed.
Re-export files at parent level for clean imports.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Core tests:
- Loading utilities (loadJsonl, loadPrompts, loadResults)
- Trajectory extraction and analysis
- Output utilities (headTailPreview, resolvePath)

Pipeline tests:
- Type validation for RawOutput, ExtractedResult, GradedResult
- Data flow contracts between pipeline stages
- Comparison data structures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…nces

- Add pipeline commands to AGENTS.md and README.md
- Add ./pipeline export to package.json
- Fix output-formats.md to match actual TimingSchema (sessionCreation, total, tokens)
- Fix troubleshooting-guide.md: adapter:check → headless --debug
- Fix schema-creation-guide.md: use package command instead of direct file path
- Fix pipeline tests: elapsed → timestamp for TrajectoryStep

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Jan 21, 2026

Code Review: Unix-style pipeline commands and architecture refactor

I've reviewed PR #21 comprehensively. This is a significant architectural refactor that adds Unix-style pipeline commands and reorganizes the codebase. Overall, the changes are well-designed and follow the project's coding standards.

✅ Strengths

1. Architecture & Design

  • Clean module organization: The new src/core/, src/commands/, src/pipeline/ structure follows the 1-level-deep pattern from .plaited/rules/module-organization.md
  • Unix philosophy: Pipeline commands (run, extract, grade, format, compare) enable composable workflows - excellent design choice
  • Proper separation of concerns: Core utilities (loading, trajectory, output) are extracted into reusable modules
  • Type safety: Comprehensive TypeScript types for pipeline stages (RawOutput, ExtractedResult, GradedResult)

2. Code Quality

  • Consistent naming: Uses PascalCase for types (RawOutput, ExtractedResult), arrow functions throughout
  • TSDoc compliance: Comprehensive documentation with @remarks, @param, @returns tags
  • No security issues: No eval(), Function(), or obvious injection vulnerabilities
  • Proper error handling: Timeout handling in runSimple and runShell functions

3. Testing

  • 405 unit tests according to PR description
  • Tests cover type validation, edge cases, and helper functions
  • Uses test() instead of it() per coding standards

🔍 Areas for Consideration

1. Breaking Changes

  • This PR removes ACP-specific code and renames from @plaited/acp-harness to @plaited/agent-eval-harness
  • Deleted files: src/acp-*.ts (7 files, ~2500 lines)
  • Impact: This is a major breaking change. Ensure version is bumped appropriately (major version), migration guide is provided, and deprecated package name is documented

2. Security Considerations

Shell Injection Risk in src/pipeline/run.ts:34-36 - Uses basic escaping for single quotes but may not be sufficient for all cases. Add documentation warning about shell injection risks when using --simple or --shell modes.

Module Loading in src/pipeline/compare.ts:40 - Dynamically imports user-provided grader modules. Already acceptable since this is the intended behavior, but ensure documentation warns users about running untrusted graders.

3. Code Patterns

Error Swallowing in src/pipeline/run.ts:46-52 - Catches all errors and returns empty array, making debugging difficult. Consider logging the error or including error information in the output.

Duplicate Code - The runSimple and runShell functions have nearly identical structure. Consider extracting common logic into a shared helper function.

4. Test Coverage

Missing Tests for critical paths: No unit tests for src/pipeline/run.ts, src/pipeline/extract.ts, src/pipeline/grade.ts (only pipeline.spec.ts with type tests). Add unit tests for the core pipeline functions, especially error paths.

5. Documentation

Excellent documentation updates with comprehensive SKILL.md updates, clear mermaid diagrams, and well-organized reference docs.

Minor gaps: No CHANGELOG.md entry for this major version. Migration guide for existing @plaited/acp-harness users would be helpful.

🎯 Recommendations

Before Merging:

  1. Add CHANGELOG.md entry documenting the breaking changes
  2. Bump package version to next major version (if not already done)
  3. Add unit tests for pipeline command implementations
  4. Document shell injection risks in SKILL.md for --simple and --shell modes
  5. Consider error logging instead of silent failures in runSimple/runShell

Post-Merge (Future Work):

  1. Deprecation notice for @plaited/acp-harness package
  2. Migration guide from ACP-based to headless adapter approach
  3. Refactor duplicate code in runSimple/runShell

✅ Conclusion

This is a well-architected refactor that significantly improves the project's flexibility and composability. The code quality is high, follows project standards, and the Unix-style pipeline approach is excellent for enabling diverse evaluation workflows.

Recommendation: Approve with minor changes (add tests, update changelog, document security considerations).


Files Reviewed: 98 changed files (+5315/-6965 lines)
Review Focus: Architecture, security, code quality, test coverage, standards compliance

@claude
Copy link

claude bot commented Jan 21, 2026

Additional Code Review Findings

Following repository standards from AGENTS.md and .plaited/rules/:

Critical Issues

1. Silent Error Swallowing (src/pipeline/run.ts:46-52, 79-82)
Both runSimple and runShell catch all errors and return empty arrays without logging. Users cannot debug failed executions.

2. Missing Unit Tests
Pipeline commands (run, extract, grade, compare) lack unit tests. Only type validation tests exist in pipeline.spec.ts.

3. Breaking Changes Undocumented
PR removes ACP code and renames package but lacks CHANGELOG.md entry and migration guide.

Recommendations

Required before merge:

  • Add error logging to runSimple/runShell catch blocks
  • Add unit tests for pipeline command implementations
  • Create CHANGELOG.md with breaking changes section
  • Add security warning to SKILL.md for shell injection risks in --simple/--shell modes
  • Bump to major version

Recommended:

  • Refactor duplicate code between runSimple/runShell
  • Convert interface to type in pipeline.types.ts (per code-review.md)

Summary

Excellent Unix-style architecture and clean code quality. Primary concerns are test coverage, error visibility, and migration documentation. Request changes to address these before merge.

Standards applied: AGENTS.md, .plaited/rules/testing.md, .plaited/rules/code-review.md

- runSimple and runShell now return { lines, error? } instead of just lines
- Stderr is captured and included in RawOutput.error field
- Exception messages are also captured instead of returning empty arrays
- Updated TSDoc to reflect new return type

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@EdwardIrby
Copy link
Member Author

Addressed Review Feedback

Fixed: Silent Error Swallowing (commit 3807aef)

  • runSimple and runShell now capture stderr and return { lines, error? } instead of just string[]
  • Stderr content is included in RawOutput.error field
  • Exception messages are captured instead of returning empty arrays
  • Updated TSDoc to reflect new return type

Test Results:

  • ✅ Unit tests: 405 pass
  • ✅ Integration tests: 12 pass

Remaining items (intentionally deferred - prototype stage):

  • CHANGELOG.md (skipped per project owner - still prototype)
  • Additional pipeline unit tests (coverage adequate for prototype)
  • Shell injection warning (will add next)

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Jan 21, 2026

Comprehensive Code Review - PR #21

I've reviewed this PR following the repository's AGENTS.md and .plaited/rules/ standards. This is an excellent architectural refactor that successfully transforms the project from ACP-specific to a general-purpose agent evaluation harness with Unix-style pipeline commands.

✅ Excellent Progress Since Last Review

The error swallowing issue has been completely resolved in commit 3807aef:

  • runSimple and runShell now return { lines, error? } instead of just string[]
  • Stderr content is captured and included in RawOutput.error field
  • Exception messages are captured instead of returning empty arrays
  • This makes debugging failures significantly easier ✅

✅ Strengths

Architecture & Design

  • Clean module organization: New src/core/, src/commands/, src/pipeline/ structure follows 1-level-deep pattern perfectly
  • Unix philosophy: Pipeline commands (run, extract, grade, format, compare) enable composable workflows - excellent design
  • Proper separation: Core utilities (loading, trajectory, output) extracted into reusable modules
  • Type safety: Comprehensive TypeScript types for all pipeline stages

Code Quality

  • ✅ No interface usage: Correctly uses type throughout (per .plaited/rules/code-review.md)
  • ✅ Arrow functions: Consistent use throughout
  • ✅ PascalCase types: RawOutput, ExtractedResult, GradedResult, etc.
  • ✅ TSDoc compliance: Comprehensive documentation with proper @remarks, @param, @returns
  • ✅ Explicit imports: All imports include .ts extensions (per module-organization.md)
  • ✅ No security issues: No eval(), Function(), or obvious injection vulnerabilities

Testing

  • 405 unit tests passing (per PR description)
  • Test organization: Uses test() instead of it() per standards
  • Type validation: Comprehensive type tests in pipeline.spec.ts

🟡 Minor Observations (Not Blockers)

1. Shell Injection Risk Documentation

The runSimple function (src/pipeline/run.ts:39) uses basic single-quote escaping:

const escapedPrompt = prompt.replace(/'/g, "'\\''"')

Status: This is acceptable for the use case (user controls both prompt and command), but consider adding a security note in SKILL.md warning users about shell injection risks when using --simple or --shell modes with untrusted input.

2. Test Coverage for Pipeline Commands

The src/pipeline/tests/pipeline.spec.ts file contains excellent type validation tests but lacks unit tests for the core pipeline execution functions (runPipeline, runExtract, runGrade, etc.).

Status: Current coverage appears adequate for prototype stage. Integration tests in src/integration_tests/ likely cover the end-to-end flows.

3. Breaking Changes Documentation

This PR removes all ACP-specific code (7 files, ~2500 lines) and renames the package from @plaited/acp-harness to @plaited/agent-eval-harness.

Observation: No CHANGELOG.md exists yet. Per the author's comment, this is intentional as the project is still in prototype stage. This is acceptable for pre-1.0 projects.

4. Code Duplication in run.ts

runSimple and runShell have nearly identical structure (only differing in how they pass the prompt). This could be refactored to share common logic, but not required for this PR.

📋 Standards Compliance Checklist

  • ✅ Module organization: 1-level-deep, no index.ts, explicit re-exports
  • ✅ Explicit import extensions: All imports use .ts
  • ✅ Type vs interface: Uses type throughout
  • ✅ Arrow functions: Consistent usage
  • ✅ TSDoc: Comprehensive documentation
  • ✅ Test style: Uses test() not it()
  • ✅ No any types: Proper TypeScript types throughout
  • ✅ Bun APIs: Uses Bun.spawn, Bun.file, Bun.write correctly
  • ✅ Error handling: Now captures stderr and exceptions properly

🎯 Recommendations

Optional Enhancements (Post-Merge)

  1. Add security warning to SKILL.md about shell injection in --simple/--shell modes
  2. Extract common logic from runSimple/runShell into shared helper
  3. Add CHANGELOG.md when approaching 1.0 release

For Future Work

  1. Additional unit tests for pipeline command implementations (not blocking for prototype)
  2. Deprecation notice for @plaited/acp-harness package name (when publishing)

✅ Final Verdict

APPROVED - This is a well-executed architectural refactor that:

  • Follows all repository coding standards
  • Addresses previous review feedback completely
  • Provides excellent Unix-style pipeline composability
  • Maintains high code quality and type safety
  • Has adequate test coverage for current stage

The previous critical issues (error swallowing) have been resolved. The remaining observations are minor enhancements that don't block merging.

Excellent work transforming this from an ACP-specific tool to a general-purpose agent evaluation harness! 🎉


Files Reviewed: 98 changed files (+5319/-6965 lines)
Standards Applied: AGENTS.md, .plaited/rules/module-organization.md, .plaited/rules/code-review.md, .plaited/rules/testing.md, .plaited/rules/bun-apis.md, .plaited/rules/accuracy.md

Addresses PR feedback about documenting security risks for --simple
and --shell modes that execute prompts via shell commands.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@EdwardIrby
Copy link
Member Author

PR Feedback Addressed ✅

Addressed the remaining review feedback:

1. Error Logging in runSimple/runShell (3807aef)

  • Refactored both functions to return { lines: string[]; error?: string } instead of string[]
  • Now captures stderr and exception messages instead of silent failures
  • Errors propagate to the RawOutput.error field in the JSONL output

2. Shell Injection Documentation (a2d396e)

  • Added security warning to SKILL.md after the Run Command section
  • Documents that --simple and --shell modes execute via shell and should not be used with untrusted prompt content
  • Recommends --schema mode (headless adapter) for untrusted inputs

Verification

  • ✅ All 405 unit tests pass
  • ✅ All 12 integration tests pass
  • ✅ Type checks pass
  • ✅ Lint/format checks pass

Feedback Not Addressed (by design)

  • CHANGELOG.md: Skipped - project still in prototype stage
  • Unit tests for pipeline commands: Existing pipeline.spec.ts covers the pipeline; additional coverage can be added later

@claude
Copy link

claude bot commented Jan 21, 2026

Code Review: Unix-style Pipeline Commands and Architecture Refactor

I've completed a comprehensive review of PR #21 following the repository's AGENTS.md and .plaited/rules/ standards. This is an excellent architectural refactor that transforms the project from ACP-specific to a general-purpose agent evaluation harness.

✅ Summary

APPROVED - All previous critical issues have been addressed. The code follows repository standards, maintains high quality, and introduces a well-designed Unix-style pipeline architecture.


🎯 Key Accomplishments

1. Error Handling Fixed (Commit 3807aef)

The previous critical issue with silent error swallowing has been completely resolved:

  • runSimple and runShell now return { lines: string[]; error?: string } instead of string[]
  • Stderr content is captured and included in RawOutput.error field
  • Exception messages are properly captured instead of returning empty arrays
  • This makes debugging failures significantly easier ✅

2. Security Documentation Added (Commit a2d396e)

Shell injection warning properly documented in SKILL.md:

  • Clear security warning after the Run Command section (line 281)
  • Explains that --simple and --shell modes execute via shell
  • Recommends --schema mode (headless adapter) for untrusted inputs ✅

✅ Architecture & Design Excellence

  • Clean module organization: New src/core/, src/commands/, src/pipeline/ structure follows 1-level-deep pattern perfectly
  • Unix philosophy: Pipeline commands (run, extract, grade, format, compare) enable composable workflows - excellent design choice
  • Proper separation of concerns: Core utilities (loading, trajectory, output) extracted into reusable modules
  • Type safety: Comprehensive TypeScript types for all pipeline stages (RawOutput, ExtractedResult, GradedResult)

✅ Code Quality Standards Compliance

Module Organization (.plaited/rules/module-organization.md)

  • ✅ No index.ts files - uses named re-export pattern
  • ✅ 1-level-deep module structure
  • ✅ Explicit .ts extensions in all imports
  • ✅ Clear separation: feature.types.ts, feature.ts pattern

Code Review Standards (.plaited/rules/code-review.md)

  • ✅ Uses type instead of interface throughout (src/pipeline/pipeline.types.ts)
  • ✅ Arrow functions consistently used
  • ✅ PascalCase for types: RawOutput, ExtractedResult, GradedResult
  • ✅ No any types - proper TypeScript types throughout
  • ✅ Object parameter pattern for functions with multiple params

Bun APIs (.plaited/rules/bun-apis.md)

  • ✅ Uses Bun.spawn() for shell execution (src/pipeline/run.ts:42, 76)
  • ✅ Uses Bun.file() and Bun.write() for file operations
  • ✅ Proper use of Bun's async APIs

Testing Standards (.plaited/rules/testing.md)

  • ✅ 405 unit tests passing (per PR description)
  • ✅ 12 integration tests passing
  • ✅ Uses test() instead of it() convention
  • ✅ Type validation tests in pipeline.spec.ts (356 lines)

TSDoc Documentation (.plaited/rules/documentation.md)

  • ✅ Comprehensive TSDoc with @remarks, @param, @returns tags
  • ✅ No @example sections (tests serve as examples)
  • ✅ Clear package documentation headers

🔍 Code Review Findings

Security Analysis

  • Shell escaping (src/pipeline/run.ts:39): Uses basic single-quote escaping for runSimple. Status: Acceptable for use case (user controls both prompt and command), and properly documented with security warning in SKILL.md.
  • Module loading (src/pipeline/compare.ts:40): Dynamically imports user-provided grader modules. Status: This is the intended behavior and inherent to the design - users are expected to trust their own graders.
  • No injection vulnerabilities: No eval(), Function(), or obvious security issues detected.

Breaking Changes (Intentionally Deferred)

This PR removes all ACP-specific code (7 files, ~2500 lines) and renames the package:

  • From: @plaited/acp-harness
  • To: @plaited/agent-eval-harness

Status: No CHANGELOG.md exists yet. Per the author's comment, this is intentional as the project is still in prototype stage. This is acceptable for pre-1.0 projects.

Code Organization

Observation: runSimple and runShell (src/pipeline/run.ts) have nearly identical structure, differing only in how they pass the prompt. This could be refactored to share common logic.

Status: Not required for this PR. Consider as future enhancement for code maintainability.

Test Coverage

Observation: Pipeline commands have type validation tests (src/pipeline/tests/pipeline.spec.ts, 356 lines) but lack unit tests for the core execution functions.

Status: Current coverage appears adequate for prototype stage. Integration tests in src/integration_tests/ cover end-to-end flows. Additional unit tests can be added as the project matures.


📋 Files Changed Summary

  • 98 files changed: +5319 lines, -6965 lines
  • Core deletions: All ACP-specific modules removed (src/acp-*.ts)
  • New architecture: Pipeline commands, core utilities, reorganized modules
  • Documentation: Comprehensive SKILL.md updates with examples and security warnings

🎯 Recommendations

✅ Completed

  • Error logging in runSimple/runShell catch blocks (commit 3807aef)
  • Security warning in SKILL.md for shell injection (commit a2d396e)
  • All tests passing (405 unit + 12 integration)

Optional Enhancements (Post-Merge)

  • Extract common logic from runSimple/runShell into shared helper function
  • Add CHANGELOG.md when approaching 1.0 release
  • Additional unit tests for pipeline command implementations (not blocking)
  • Deprecation notice for @plaited/acp-harness package name (when publishing)

✅ Final Verdict

APPROVED - This is a well-executed architectural refactor that:

  • ✅ Addresses all previous review feedback completely
  • ✅ Follows all repository coding standards
  • ✅ Provides excellent Unix-style pipeline composability
  • ✅ Maintains high code quality and type safety
  • ✅ Has adequate test coverage for current stage
  • ✅ Includes proper security documentation

The transformation from an ACP-specific tool to a general-purpose agent evaluation harness is well-designed and maintains the high quality standards of the codebase.

Excellent work! 🎉


Review Standards Applied: AGENTS.md, .plaited/rules/module-organization.md, .plaited/rules/code-review.md, .plaited/rules/testing.md, .plaited/rules/bun-apis.md, .plaited/rules/documentation.md, .plaited/rules/accuracy.md

@EdwardIrby EdwardIrby merged commit 0d72976 into main Jan 21, 2026
8 checks passed
@EdwardIrby EdwardIrby deleted the feat/agent-eval-harness branch January 21, 2026 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants