A framework-agnostic metric for measuring AI code generation quality.
AI coding tools write code and tests together. The tests always pass β but they only validate what the AI thought to check, not what the specification actually requires. There's no independent quality signal.
Shadow Score = (sealed_failures / sealed_total) Γ 100
Generate acceptance tests from the spec before code exists. Hide them from the AI. Build the code. Then run both test suites. The delta is your Shadow Score β an adversarial, quantitative measure of implementation quality.
| Score | Level | Meaning |
|---|---|---|
| 0% | β Perfect | Implementation tests covered everything |
| 1β15% | π’ Minor | Small blind spots β edge cases |
| 16β30% | π‘ Moderate | Meaningful gaps in coverage |
| 31β50% | π Significant | Major gaps β review approach |
| >50% | π΄ Critical | Fundamental quality issues |
Shadow Score measures what your AI missed when it couldn't see the tests. Shadow Score sticks because it maps to something developers already know: shadow testing β running hidden checks alongside production to catch what the main path misses. That's exactly what sealed-envelope testing does. The AI builds code. Shadow tests β written before the code existed and hidden from the builder β judge it. Whatever fails is what the AI couldn't see.
0% = no shadows. The implementation covered everything the spec required.
60% = flying blind. The AI missed more than half the acceptance criteria it never knew about.
# Python
python validators/shadow-score.py --sealed results-sealed.json --open results-open.json
# Go
cd validators && go build -o shadow-score-go . && cd ..
./validators/shadow-score-go --sealed results-sealed.json --open results-open.json
# Shell (minimal deps)
./validators/shadow-score.sh results-sealed.txt results-open.txt- Write sealed tests from your spec (requirements β test cases, before code exists)
- Build the code β the implementer never sees the sealed tests
- Run both suites β sealed tests and the implementer's own tests
- Compute:
failed_sealed / total_sealed Γ 100 - Report: Use the JSON schema or markdown format
That's it. Framework, language, and tooling don't matter β Shadow Score works anywhere.
Shadow Score is computed using the Sealed-Envelope Protocol β a 4-phase testing methodology:
SPEC βββΊ SEAL GENERATION βββΊ IMPLEMENTATION βββΊ VALIDATION βββΊ HARDENING
(tests from spec, (code + own (run both (fix from
hidden from tests, never suites, failure msgs
implementer) sees sealed) compute gap) only β no
test code)
The critical rule: The implementer never sees the sealed tests. During hardening, they receive only failure messages (test name, expected, actual) β never the test source code. This forces root-cause fixes, not test-targeting hacks.
Full protocol details: SPEC.md Β§4
| Level | What's Required | Use Case |
|---|---|---|
| L1 β Shadow Score | Compute + report Shadow Score | Retrofitting onto existing test suites |
| L2 β Sealed Envelope | L1 + test isolation + tamper hash | AI agent pipelines |
| L3 β Full Protocol | L2 + hardening loop + velocity tracking | Production autonomous builds |
The reference Level 3 implementation is Dark Factory β an autonomous agentic build system for the GitHub Copilot CLI with sealed-envelope testing.
| Example | Shadow Score | What Happened |
|---|---|---|
| 01 β Perfect Score | 0% β | All sealed tests passed |
| 02 β Minor Gaps | 11.1% π’ | 2 edge cases missed |
| 03 β Critical Gaps | 60% π΄ | Only happy path tested |
Shadow Reports can be produced in JSON (machine-readable) or Markdown (human-readable).
JSON schema: validators/shadow-report-schema.json
{
"shadow_score_spec_version": "1.0.0",
"report": {
"shadow_score": 11.1,
"level": "minor"
},
"sealed_tests": { "total": 18, "passed": 16, "failed": 2 },
"failures": [
{
"test_name": "test_rejects_gpl_dependency",
"category": "security",
"expected": "exit code 2",
"actual": "exit code 0",
"message": "GPL dependency not blocked"
}
]
}Full schema: SPEC.md Β§5
| Project | Conformance | Description |
|---|---|---|
| Dark Factory | Level 3 | Reference implementation β autonomous agentic build system |
Using Shadow Score? Open a PR to add your project.
Covers: definitions, formula, sealed-envelope protocol, reporting format, conformance levels, worked examples, and FAQ.
Shadow Score is an open specification. Contributions welcome:
- Spec changes: Open an issue to discuss before submitting a PR
- New validators: PRs for additional language validators (Go, Rust, TypeScript) are welcome
- Adopter listings: Add your project to the Adopters table
MIT Β© 2026 DUBSOpenHub