Skip to content

Releases: sjnims/cc-plugin-eval

v0.4.0

22 Jan 10:14
5c2eead

Choose a tag to compare

Summary

This release adds forceful query termination, comprehensive cost tracking across all pipeline stages, and several bug fixes.

What's Changed

Added

  • Query Termination: Query.close() method for forceful query termination (#340)
  • Comprehensive Cost Tracking: Stage-level cost breakdown in EvaluationOutput (#326)
  • Total Cost Aggregation: All stage costs aggregated into total_cost_usd metric (#325)
  • Evaluation Stage Costs: LLM costs tracked for both sync and batch modes (#324)
  • Generation Stage Costs: LLM costs tracked during scenario generation (#323)
  • Cost Utility: calculateCostFromUsage utility for SDK message responses (#322)
  • CLI Documentation: Added CLI reference and improved help output (#315)

Changed

  • Updated Anthropic tooling versions
  • Removed redundant top-level cost fields from EvaluationOutput (#328)
  • Upgraded Claude Code Action workflows to Opus 4.5

Fixed

  • YAML null values normalized to undefined before config validation (#339)
  • Execution cost estimation formula corrected (#332)
  • Plugin load costs included in evaluation metrics total
  • Plugin load API costs tracked in execution metrics (#331)
  • All stage costs tracked in E2E tests
  • Config files aligned with schema model defaults

Full Changelog: v0.3.0...v0.4.0

v0.3.0

19 Jan 22:11
135b9e4

Choose a tag to compare

Summary

A significant release with 128 commits since v0.2.0, featuring major performance improvements, new features, and security hardening.

Highlights

  • Session Batching as Default: ~80% faster startup by batching scenarios by component
  • Anthropic Batches API: Parallel LLM judge calls in Stage 4 evaluation
  • Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
  • Prompt Caching: Reduced token costs via Anthropic prompt caching
  • E2E Test Performance: 93% faster (20min → 87s)
  • Security: ReDoS vulnerability fixes and path boundary validation
  • SDK Resilience: Retry-after-ms, AbortController, configurable timeouts

What's Changed

Added

  • Session Batching: Default execution strategy now batches scenarios by component, reducing subprocess overhead by ~80%
  • Anthropic Batches API: Stage 4 evaluation uses batch API for parallel LLM judge calls
  • Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
  • Prompt Caching: Anthropic prompt caching for repeated system prompts reduces token costs
  • Per-Model Cost Tracking: Detailed cost breakdown by model with thinking token limits
  • SDK Resilience: Retry-after-ms support, AbortController, configurable timeouts, typed error handling
  • SubagentStart/SubagentStop Hooks: Improved agent detection accuracy
  • PostToolUse Hook Capture: Accurate tool success detection
  • Zod Runtime Validation: LLM judge responses validated with Zod schemas
  • Defense-in-Depth ReDoS Protection: Additional regex safeguards

Changed

  • Upgraded Claude Agent SDK from 0.1.76 to 0.2.9
  • Updated model pricing and default model selections
  • Major refactoring: reduced cognitive complexity, eliminated DRY violations, broke circular imports

Fixed

  • ReDoS vulnerabilities in regex patterns
  • Rate limiter race condition under parallel execution
  • Plugin load error and SDK timeout warnings in E2E tests
  • System prompt inclusion in batch evaluation requests

Performance

  • E2E test suite 93% faster (20min → 87s)
  • Regex caching and rate limiting granularity optimizations
  • Hook callback allocation optimized in batched execution mode

Security

  • ReDoS vulnerabilities patched in custom sanitization patterns
  • Defense-in-depth ReDoS protection layer added
  • Path boundary validation prevents directory traversal attacks

Full Changelog: v0.2.0...v0.3.0

v0.2.0

11 Jan 02:31
v0.2.0
730572c

Choose a tag to compare

Summary

Major new evaluation capabilities for hooks and MCP servers, plus E2E test infrastructure.

What's Changed

Added

  • MCP server evaluation with tool detection via mcp__<server>__<tool> pattern (#63)
  • Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
  • E2E integration tests with real Claude Agent SDK (#68)
  • ReDoS protection for custom sanitization patterns (#66)

Changed

  • Modernized CI workflows with updated action versions (#64, #65)
  • Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
  • Improved README and CLAUDE.md documentation (#69)

Fixed

  • CI not failing on codecov errors for Dependabot PRs
  • CLI --version now reads from package.json instead of hardcoded value

Full Changelog: v0.1.0...v0.2.0

v0.1.0

03 Jan 04:02
a5b69b8

Choose a tag to compare

Summary

First release of cc-plugin-eval - a 4-stage evaluation framework for testing Claude Code plugin component triggering.

What's Changed

Added

  • Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
  • Support for skills, agents, and commands evaluation
  • Programmatic detection via tool capture parsing
  • LLM judge for quality assessment with multi-sampling
  • Resume capability with state checkpointing
  • Cost estimation before execution (dry-run mode)
  • Multiple output formats (JSON, YAML, JUnit XML, TAP)
  • Semantic variation testing for trigger robustness
  • Rate limiter for API call protection (#32)
  • Symlink resolution for plugin path validation (#33)
  • PII filtering for verbose transcript logging (#34)
  • Custom sanitization regex pattern validation (#46)
  • Comprehensive test suite with 943 tests and 93%+ coverage

Changed

  • Tuning configuration extracted from hardcoded values (#26)
  • Renamed seed.yaml to config.yaml for clarity (#25)

Fixed

  • Correct Anthropic structured output API usage in LLM judge (#9)
  • Variance propagation from runJudgment to metrics (#30)
  • Centralized logger and pricing utilities (#43)

Full Changelog: https://github.com/sjnims/cc-plugin-eval/commits/v0.1.0