Releases: sjnims/cc-plugin-eval
Releases · sjnims/cc-plugin-eval
v0.4.0
Summary
This release adds forceful query termination, comprehensive cost tracking across all pipeline stages, and several bug fixes.
What's Changed
Added
- Query Termination:
Query.close()method for forceful query termination (#340) - Comprehensive Cost Tracking: Stage-level cost breakdown in EvaluationOutput (#326)
- Total Cost Aggregation: All stage costs aggregated into total_cost_usd metric (#325)
- Evaluation Stage Costs: LLM costs tracked for both sync and batch modes (#324)
- Generation Stage Costs: LLM costs tracked during scenario generation (#323)
- Cost Utility:
calculateCostFromUsageutility for SDK message responses (#322) - CLI Documentation: Added CLI reference and improved help output (#315)
Changed
- Updated Anthropic tooling versions
- Removed redundant top-level cost fields from EvaluationOutput (#328)
- Upgraded Claude Code Action workflows to Opus 4.5
Fixed
- YAML null values normalized to undefined before config validation (#339)
- Execution cost estimation formula corrected (#332)
- Plugin load costs included in evaluation metrics total
- Plugin load API costs tracked in execution metrics (#331)
- All stage costs tracked in E2E tests
- Config files aligned with schema model defaults
Full Changelog: v0.3.0...v0.4.0
v0.3.0
Summary
A significant release with 128 commits since v0.2.0, featuring major performance improvements, new features, and security hardening.
Highlights
- Session Batching as Default: ~80% faster startup by batching scenarios by component
- Anthropic Batches API: Parallel LLM judge calls in Stage 4 evaluation
- Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
- Prompt Caching: Reduced token costs via Anthropic prompt caching
- E2E Test Performance: 93% faster (20min → 87s)
- Security: ReDoS vulnerability fixes and path boundary validation
- SDK Resilience: Retry-after-ms, AbortController, configurable timeouts
What's Changed
Added
- Session Batching: Default execution strategy now batches scenarios by component, reducing subprocess overhead by ~80%
- Anthropic Batches API: Stage 4 evaluation uses batch API for parallel LLM judge calls
- Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
- Prompt Caching: Anthropic prompt caching for repeated system prompts reduces token costs
- Per-Model Cost Tracking: Detailed cost breakdown by model with thinking token limits
- SDK Resilience: Retry-after-ms support, AbortController, configurable timeouts, typed error handling
- SubagentStart/SubagentStop Hooks: Improved agent detection accuracy
- PostToolUse Hook Capture: Accurate tool success detection
- Zod Runtime Validation: LLM judge responses validated with Zod schemas
- Defense-in-Depth ReDoS Protection: Additional regex safeguards
Changed
- Upgraded Claude Agent SDK from 0.1.76 to 0.2.9
- Updated model pricing and default model selections
- Major refactoring: reduced cognitive complexity, eliminated DRY violations, broke circular imports
Fixed
- ReDoS vulnerabilities in regex patterns
- Rate limiter race condition under parallel execution
- Plugin load error and SDK timeout warnings in E2E tests
- System prompt inclusion in batch evaluation requests
Performance
- E2E test suite 93% faster (20min → 87s)
- Regex caching and rate limiting granularity optimizations
- Hook callback allocation optimized in batched execution mode
Security
- ReDoS vulnerabilities patched in custom sanitization patterns
- Defense-in-depth ReDoS protection layer added
- Path boundary validation prevents directory traversal attacks
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Summary
Major new evaluation capabilities for hooks and MCP servers, plus E2E test infrastructure.
What's Changed
Added
- MCP server evaluation with tool detection via
mcp__<server>__<tool>pattern (#63) - Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
- E2E integration tests with real Claude Agent SDK (#68)
- ReDoS protection for custom sanitization patterns (#66)
Changed
- Modernized CI workflows with updated action versions (#64, #65)
- Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
- Improved README and CLAUDE.md documentation (#69)
Fixed
- CI not failing on codecov errors for Dependabot PRs
- CLI --version now reads from package.json instead of hardcoded value
Full Changelog: v0.1.0...v0.2.0
v0.1.0
Summary
First release of cc-plugin-eval - a 4-stage evaluation framework for testing Claude Code plugin component triggering.
What's Changed
Added
- Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
- Support for skills, agents, and commands evaluation
- Programmatic detection via tool capture parsing
- LLM judge for quality assessment with multi-sampling
- Resume capability with state checkpointing
- Cost estimation before execution (dry-run mode)
- Multiple output formats (JSON, YAML, JUnit XML, TAP)
- Semantic variation testing for trigger robustness
- Rate limiter for API call protection (#32)
- Symlink resolution for plugin path validation (#33)
- PII filtering for verbose transcript logging (#34)
- Custom sanitization regex pattern validation (#46)
- Comprehensive test suite with 943 tests and 93%+ coverage
Changed
- Tuning configuration extracted from hardcoded values (#26)
- Renamed seed.yaml to config.yaml for clarity (#25)
Fixed
- Correct Anthropic structured output API usage in LLM judge (#9)
- Variance propagation from runJudgment to metrics (#30)
- Centralized logger and pricing utilities (#43)
Full Changelog: https://github.com/sjnims/cc-plugin-eval/commits/v0.1.0