Releases · sjnims/cc-plugin-eval

22 Jan 10:14

sjnims

v0.4.0

5c2eead

v0.4.0 Latest

Latest

Summary

This release adds forceful query termination, comprehensive cost tracking across all pipeline stages, and several bug fixes.

What's Changed

Added

Query Termination: Query.close() method for forceful query termination (#340)
Comprehensive Cost Tracking: Stage-level cost breakdown in EvaluationOutput (#326)
Total Cost Aggregation: All stage costs aggregated into total_cost_usd metric (#325)
Evaluation Stage Costs: LLM costs tracked for both sync and batch modes (#324)
Generation Stage Costs: LLM costs tracked during scenario generation (#323)
Cost Utility: calculateCostFromUsage utility for SDK message responses (#322)
CLI Documentation: Added CLI reference and improved help output (#315)

Changed

Updated Anthropic tooling versions
Removed redundant top-level cost fields from EvaluationOutput (#328)
Upgraded Claude Code Action workflows to Opus 4.5

Fixed

YAML null values normalized to undefined before config validation (#339)
Execution cost estimation formula corrected (#332)
Plugin load costs included in evaluation metrics total
Plugin load API costs tracked in execution metrics (#331)
All stage costs tracked in E2E tests
Config files aligned with schema model defaults

Full Changelog: v0.3.0...v0.4.0

Assets 2

19 Jan 22:11

sjnims

v0.3.0

135b9e4

v0.3.0

Summary

A significant release with 128 commits since v0.2.0, featuring major performance improvements, new features, and security hardening.

Highlights

Session Batching as Default: ~80% faster startup by batching scenarios by component
Anthropic Batches API: Parallel LLM judge calls in Stage 4 evaluation
Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
Prompt Caching: Reduced token costs via Anthropic prompt caching
E2E Test Performance: 93% faster (20min → 87s)
Security: ReDoS vulnerability fixes and path boundary validation
SDK Resilience: Retry-after-ms, AbortController, configurable timeouts

What's Changed

Added

Session Batching: Default execution strategy now batches scenarios by component, reducing subprocess overhead by ~80%
Anthropic Batches API: Stage 4 evaluation uses batch API for parallel LLM judge calls
Parallel LLM Generation: Stage 2 scenario generation now parallelizes LLM calls
Prompt Caching: Anthropic prompt caching for repeated system prompts reduces token costs
Per-Model Cost Tracking: Detailed cost breakdown by model with thinking token limits
SDK Resilience: Retry-after-ms support, AbortController, configurable timeouts, typed error handling
SubagentStart/SubagentStop Hooks: Improved agent detection accuracy
PostToolUse Hook Capture: Accurate tool success detection
Zod Runtime Validation: LLM judge responses validated with Zod schemas
Defense-in-Depth ReDoS Protection: Additional regex safeguards

Changed

Upgraded Claude Agent SDK from 0.1.76 to 0.2.9
Updated model pricing and default model selections
Major refactoring: reduced cognitive complexity, eliminated DRY violations, broke circular imports

Fixed

ReDoS vulnerabilities in regex patterns
Rate limiter race condition under parallel execution
Plugin load error and SDK timeout warnings in E2E tests
System prompt inclusion in batch evaluation requests

Performance

E2E test suite 93% faster (20min → 87s)
Regex caching and rate limiting granularity optimizations
Hook callback allocation optimized in batched execution mode

Security

ReDoS vulnerabilities patched in custom sanitization patterns
Defense-in-depth ReDoS protection layer added
Path boundary validation prevents directory traversal attacks

Full Changelog: v0.2.0...v0.3.0

Assets 2

11 Jan 02:31

sjnims

v0.2.0

730572c

v0.2.0

Summary

Major new evaluation capabilities for hooks and MCP servers, plus E2E test infrastructure.

What's Changed

Added

MCP server evaluation with tool detection via mcp__<server>__<tool> pattern (#63)
Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
E2E integration tests with real Claude Agent SDK (#68)
ReDoS protection for custom sanitization patterns (#66)

Changed

Modernized CI workflows with updated action versions (#64, #65)
Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
Improved README and CLAUDE.md documentation (#69)

Fixed

CI not failing on codecov errors for Dependabot PRs
CLI --version now reads from package.json instead of hardcoded value

Full Changelog: v0.1.0...v0.2.0

Assets 2

03 Jan 04:02

sjnims

v0.1.0

a5b69b8

v0.1.0

Summary

First release of cc-plugin-eval - a 4-stage evaluation framework for testing Claude Code plugin component triggering.

What's Changed

Added

Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
Support for skills, agents, and commands evaluation
Programmatic detection via tool capture parsing
LLM judge for quality assessment with multi-sampling
Resume capability with state checkpointing
Cost estimation before execution (dry-run mode)
Multiple output formats (JSON, YAML, JUnit XML, TAP)
Semantic variation testing for trigger robustness
Rate limiter for API call protection (#32)
Symlink resolution for plugin path validation (#33)
PII filtering for verbose transcript logging (#34)
Custom sanitization regex pattern validation (#46)
Comprehensive test suite with 943 tests and 93%+ coverage

Changed

Tuning configuration extracted from hardcoded values (#26)
Renamed seed.yaml to config.yaml for clarity (#25)

Fixed

Correct Anthropic structured output API usage in LLM judge (#9)
Variance propagation from runJudgment to metrics (#30)
Centralized logger and pricing utilities (#43)

Full Changelog: https://github.com/sjnims/cc-plugin-eval/commits/v0.1.0

Assets 2

Releases: sjnims/cc-plugin-eval

v0.4.0

Summary

What's Changed

Added

Changed

Fixed

Uh oh!

v0.3.0

Summary

Highlights

What's Changed

Added

Changed

Fixed

Performance

Security

Uh oh!

v0.2.0

Summary

What's Changed

Added

Changed

Fixed

Uh oh!

v0.1.0

Summary

What's Changed

Added

Changed

Fixed

Uh oh!