-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Unix-style pipeline commands and architecture refactor #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BREAKING CHANGE: Package renamed from @plaited/acp-harness to @plaited/agent-eval-harness Major changes: - Remove ACP SDK dependency and all ACP protocol handling - Capture/trials now use headless session manager directly - Add debug mode (--debug) for verbose JSONPath matching output - Add exit code/signal tracking with ProcessExitInfo type - Add schema v2 support with timeout field Skill renames: - acp-harness → agent-eval-harness - acp-adapters → headless-adapters CLI changes: - capture/trials now require --schema flag (no positional agent command) - Remove adapter:check and adapter:scaffold commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename asset files: Dockerfile.acp → Dockerfile.eval, docker-compose.acp.yml → docker-compose.eval.yml - Update README.md with new package name and CLI examples - Rename constants: ACP_METHODS → PROTOCOL_METHODS, ACP_PROTOCOL_VERSION → PROTOCOL_VERSION - Update CI workflow to use generic filter names - Update all skill documentation to remove ACP references - Update rules examples to use generic terms - Fix GitHub URLs in package.json Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates Docker service name across all documentation and compose files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update code examples in rules to use current naming (SessionManager, harness.ts) - Remove agent-skills-spec and agent-client-protocol MCP servers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename acp-*.spec.ts to claude.spec.ts/gemini.spec.ts - Use createSessionManager instead of removed createACPClient - Load JSON schemas properly with Bun.file().json() before parsing - Fix Gemini schema contentPath from $.stats to $.content - Make math test resilient to Gemini output formatting variations All 12 integration tests pass in Docker. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements "BASH Is All You Need" refactoring following Unix philosophy: composable, single-purpose tools that can be piped together. Core module extraction (src/core/): - loading.ts: loadPrompts(), loadResults(), loadJsonl() - trajectory.ts: extractTrajectory(), extractOutput(), hasToolErrors() - output.ts: writeOutput(), logProgress(), headTailPreview() Pipeline commands (src/pipeline/): - run: execute prompts in schema/simple/shell modes - extract: parse raw output into trajectories - grade: apply grader functions to results - format: convert to jsonl/markdown/csv - compare: compare multiple runs with ranking Schema enhancements: - Add passthrough mode for well-structured agent output - Consolidate to single schema version (prototype stage) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restructure codebase following module organization rules: - commands/: CLI command implementations (capture, trials, etc.) - headless/: Schema-driven headless adapter system - schemas/: Zod schemas, constants, grader loader - core/: Shared utilities (loading, trajectory, output) - pipeline/: Unix-style pipeline commands (existing) Each module has tests/ subdirectory with fixtures where needed. Re-export files at parent level for clean imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Core tests: - Loading utilities (loadJsonl, loadPrompts, loadResults) - Trajectory extraction and analysis - Output utilities (headTailPreview, resolvePath) Pipeline tests: - Type validation for RawOutput, ExtractedResult, GradedResult - Data flow contracts between pipeline stages - Comparison data structures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…nces - Add pipeline commands to AGENTS.md and README.md - Add ./pipeline export to package.json - Fix output-formats.md to match actual TimingSchema (sessionCreation, total, tokens) - Fix troubleshooting-guide.md: adapter:check → headless --debug - Fix schema-creation-guide.md: use package command instead of direct file path - Fix pipeline tests: elapsed → timestamp for TrajectoryStep Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code Review: Unix-style pipeline commands and architecture refactorI've reviewed PR #21 comprehensively. This is a significant architectural refactor that adds Unix-style pipeline commands and reorganizes the codebase. Overall, the changes are well-designed and follow the project's coding standards. ✅ Strengths1. Architecture & Design
2. Code Quality
3. Testing
🔍 Areas for Consideration1. Breaking Changes
2. Security Considerations Shell Injection Risk in src/pipeline/run.ts:34-36 - Uses basic escaping for single quotes but may not be sufficient for all cases. Add documentation warning about shell injection risks when using --simple or --shell modes. Module Loading in src/pipeline/compare.ts:40 - Dynamically imports user-provided grader modules. Already acceptable since this is the intended behavior, but ensure documentation warns users about running untrusted graders. 3. Code Patterns Error Swallowing in src/pipeline/run.ts:46-52 - Catches all errors and returns empty array, making debugging difficult. Consider logging the error or including error information in the output. Duplicate Code - The runSimple and runShell functions have nearly identical structure. Consider extracting common logic into a shared helper function. 4. Test Coverage Missing Tests for critical paths: No unit tests for src/pipeline/run.ts, src/pipeline/extract.ts, src/pipeline/grade.ts (only pipeline.spec.ts with type tests). Add unit tests for the core pipeline functions, especially error paths. 5. Documentation Excellent documentation updates with comprehensive SKILL.md updates, clear mermaid diagrams, and well-organized reference docs. Minor gaps: No CHANGELOG.md entry for this major version. Migration guide for existing @plaited/acp-harness users would be helpful. 🎯 RecommendationsBefore Merging:
Post-Merge (Future Work):
✅ ConclusionThis is a well-architected refactor that significantly improves the project's flexibility and composability. The code quality is high, follows project standards, and the Unix-style pipeline approach is excellent for enabling diverse evaluation workflows. Recommendation: Approve with minor changes (add tests, update changelog, document security considerations). Files Reviewed: 98 changed files (+5315/-6965 lines) |
Additional Code Review FindingsFollowing repository standards from AGENTS.md and .plaited/rules/: Critical Issues1. Silent Error Swallowing (src/pipeline/run.ts:46-52, 79-82) 2. Missing Unit Tests 3. Breaking Changes Undocumented RecommendationsRequired before merge:
Recommended:
SummaryExcellent Unix-style architecture and clean code quality. Primary concerns are test coverage, error visibility, and migration documentation. Request changes to address these before merge. Standards applied: AGENTS.md, .plaited/rules/testing.md, .plaited/rules/code-review.md |
- runSimple and runShell now return { lines, error? } instead of just lines
- Stderr is captured and included in RawOutput.error field
- Exception messages are also captured instead of returning empty arrays
- Updated TSDoc to reflect new return type
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addressed Review FeedbackFixed: Silent Error Swallowing (commit 3807aef)
Test Results:
Remaining items (intentionally deferred - prototype stage):
🤖 Generated with Claude Code |
Comprehensive Code Review - PR #21I've reviewed this PR following the repository's AGENTS.md and .plaited/rules/ standards. This is an excellent architectural refactor that successfully transforms the project from ACP-specific to a general-purpose agent evaluation harness with Unix-style pipeline commands. ✅ Excellent Progress Since Last ReviewThe error swallowing issue has been completely resolved in commit 3807aef:
✅ StrengthsArchitecture & Design
Code Quality
Testing
🟡 Minor Observations (Not Blockers)1. Shell Injection Risk DocumentationThe const escapedPrompt = prompt.replace(/'/g, "'\\''"')Status: This is acceptable for the use case (user controls both prompt and command), but consider adding a security note in SKILL.md warning users about shell injection risks when using 2. Test Coverage for Pipeline CommandsThe Status: Current coverage appears adequate for prototype stage. Integration tests in 3. Breaking Changes DocumentationThis PR removes all ACP-specific code (7 files, ~2500 lines) and renames the package from Observation: No CHANGELOG.md exists yet. Per the author's comment, this is intentional as the project is still in prototype stage. This is acceptable for pre-1.0 projects. 4. Code Duplication in run.ts
📋 Standards Compliance Checklist
🎯 RecommendationsOptional Enhancements (Post-Merge)
For Future Work
✅ Final VerdictAPPROVED - This is a well-executed architectural refactor that:
The previous critical issues (error swallowing) have been resolved. The remaining observations are minor enhancements that don't block merging. Excellent work transforming this from an ACP-specific tool to a general-purpose agent evaluation harness! 🎉 Files Reviewed: 98 changed files (+5319/-6965 lines) |
Addresses PR feedback about documenting security risks for --simple and --shell modes that execute prompts via shell commands. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PR Feedback Addressed ✅Addressed the remaining review feedback: 1. Error Logging in
|
Code Review: Unix-style Pipeline Commands and Architecture RefactorI've completed a comprehensive review of PR #21 following the repository's AGENTS.md and .plaited/rules/ standards. This is an excellent architectural refactor that transforms the project from ACP-specific to a general-purpose agent evaluation harness. ✅ SummaryAPPROVED - All previous critical issues have been addressed. The code follows repository standards, maintains high quality, and introduces a well-designed Unix-style pipeline architecture. 🎯 Key Accomplishments1. Error Handling Fixed (Commit 3807aef)The previous critical issue with silent error swallowing has been completely resolved:
2. Security Documentation Added (Commit a2d396e)Shell injection warning properly documented in SKILL.md:
✅ Architecture & Design Excellence
✅ Code Quality Standards ComplianceModule Organization (.plaited/rules/module-organization.md)
Code Review Standards (.plaited/rules/code-review.md)
Bun APIs (.plaited/rules/bun-apis.md)
Testing Standards (.plaited/rules/testing.md)
TSDoc Documentation (.plaited/rules/documentation.md)
🔍 Code Review FindingsSecurity Analysis
Breaking Changes (Intentionally Deferred)This PR removes all ACP-specific code (7 files, ~2500 lines) and renames the package:
Status: No CHANGELOG.md exists yet. Per the author's comment, this is intentional as the project is still in prototype stage. This is acceptable for pre-1.0 projects. Code OrganizationObservation: Status: Not required for this PR. Consider as future enhancement for code maintainability. Test CoverageObservation: Pipeline commands have type validation tests (src/pipeline/tests/pipeline.spec.ts, 356 lines) but lack unit tests for the core execution functions. Status: Current coverage appears adequate for prototype stage. Integration tests in 📋 Files Changed Summary
🎯 Recommendations✅ Completed
Optional Enhancements (Post-Merge)
✅ Final VerdictAPPROVED - This is a well-executed architectural refactor that:
The transformation from an ACP-specific tool to a general-purpose agent evaluation harness is well-designed and maintains the high quality standards of the codebase. Excellent work! 🎉 Review Standards Applied: AGENTS.md, .plaited/rules/module-organization.md, .plaited/rules/code-review.md, .plaited/rules/testing.md, .plaited/rules/bun-apis.md, .plaited/rules/documentation.md, .plaited/rules/accuracy.md |
Summary
run,extract,grade,format,comparefor composable evaluation workflowssrc/core/(loading, trajectory, output)src/Key Changes
Pipeline Commands
New Compare Command
Directory Structure
Test plan
bun run check)🤖 Generated with Claude Code