Deepen SDLC review quality with analysis and suggestions fields by james-in-a-box[bot] · Pull Request #899 · jwbron/egg

james-in-a-box · 2026-02-24T08:39:20Z

Summary

Deepen SDLC pipeline review quality to match PR reviewer thoroughness.

SDLC reviewers were producing shallow ~2 sentence verdicts with empty feedback even when approving, while PR reviewers on the same code produced detailed file-by-file analysis, found specific issues, and drove follow-up fixes. The root cause was threefold: the verdict schema had "feedback": "empty if approved", the SDLC prompts lacked the review conventions and procedural rigor of the PR prompts, and there was no way to surface non-blocking observations.

This PR addresses all three gaps:

Expanded verdict schema: Added analysis (always populated) and suggestions (non-blocking, even when approving) fields to ReviewVerdict. Introduced AggregatedReviewResult NamedTuple that separates blocking feedback from advisory content.
Enhanced review prompts: Added review conventions section (comprehensive, specific, direct, suggest fixes, provide context) to all SDLC reviewers. Code reviewers get "last line of defense" framing. Each reviewer type gets analysis format guidance (file-by-file for code, criterion-by-criterion for contract, section-by-section for plan/refine). Draft-based reviewers get expanded procedural steps requiring cross-referencing and evidence citing.
Advisory content logging: Aggregation logic now collects analysis and suggestions from ALL verdicts (including approved) and logs them for observability, while only passing blocking feedback to revision cycles.

Backward compatible — old verdict JSON without the new fields parses correctly via Pydantic defaults.

Closes #897

Test plan

Run pytest orchestrator/tests/test_pipeline_prompts.py -v — verify new TestBuildReviewPrompt class passes (22 tests covering verdict format, conventions, preambles, delta review, field guidelines)
Run pytest orchestrator/tests/test_tier3_execute.py -v — verify backward compat and new-field parsing tests pass
Run pytest tests/workflows/test_multi_reviewer.py -v — verify aggregation tests including advisory content collection
Run full pytest orchestrator/tests/ tests/workflows/ to confirm no regressions
Verify that the sdlc-pipeline.md docs accurately reflect the new schema and conventions

Authored-by: egg

Container 9b1c05f58615245f7dd6774518c44fb90d3f811b8a1c539ada5df9eaba8d091f exited with uncommitted changes. This commit preserves the agent's work-in-progress. Authored-by: egg

Container 004a65f8ea27287f969ec30e7caf40ed62e070578cedaeb6461e6275eec2f497 exited with uncommitted changes. This commit preserves the agent's work-in-progress. Authored-by: egg

…deeper SDLC reviews

When the docker SDK is not installed, other test files inject MagicMock into sys.modules["docker"], causing docker_client.py's `from docker.errors import NotFound, DockerException` to bind MagicMock objects instead of real exception classes. This breaks `except NotFound` clauses at runtime (TypeError: catching classes that do not inherit from BaseException). The conftest now creates proper exception class stubs matching the docker SDK hierarchy and registers them in sys.modules before test collection, so that both the source module and tests use real exception classes. Fixes 2 failing tests: - TestDockerClientConnection.test_is_connected_false - TestContainerOperations.test_start_container_not_found

egg-reviewer

No agent-mode design concerns.

The structured verdict JSON (with new analysis, suggestions, feedback fields) has a legitimate downstream consumer — the _aggregate_review_verdicts() function programmatically separates blocking vs. advisory content. This is machine-to-machine parsing, which is the appropriate use case for structured output per the design guidelines.

The review conventions added to prompts describe quality attributes ("be comprehensive", "be specific", "suggest fixes") and analysis format expectations ("file-by-file", "criterion-by-criterion") rather than rigid step-by-step procedures. The expanded procedural steps for draft-based reviewers are methodological guidance that helps the agent understand what thorough review looks like for each reviewer type — orientation, not constraint.

The agent-design reviewer preamble change (allowing brief approvals when no concerns exist) is a positive alignment — it avoids demanding unnecessary ceremony from the agent.

— Authored by egg

egg-reviewer

Review: Deepen SDLC review quality with analysis and suggestions fields

Blocking Issue

1. 198 .hypothesis/constants/ files committed (must fix)

This PR introduces 198 new files under .hypothesis/constants/. These are Hypothesis testing framework cache/database files — machine-specific, non-deterministic, and never previously tracked in this repo. No merged PR in the repo history has ever included .hypothesis/ files.

These must be removed from the PR and .hypothesis/ should be added to .gitignore. They inflate the diff from ~30 substantive changed lines to ~3000 additions and will create merge noise going forward.

Fix:

git rm -r --cached .hypothesis/
echo '.hypothesis/' >> .gitignore
git add .gitignore
git commit -m "Remove .hypothesis cache files and add to .gitignore"

Non-blocking observations

The core code changes (model expansion, prompt enhancement, aggregation refactor) are well-structured. Specific observations:

2. _aggregate_review_verdicts — advisory collection is outside the needs_revision guard (correct but worth noting)

In the aggregation function (pipelines.py:1812-1824), the advisory content collection block runs for ALL verdicts (not just needs_revision), which is the stated intent. Both call sites (lines ~3541 and ~5498) correctly use named field access (agg_result.verdict, agg_result.blocking_feedback, agg_result.advisory_content). The NamedTuple design is clean and the return type change from tuple[str, str] to AggregatedReviewResult is handled correctly at all call sites.

3. Advisory content logging truncates at 500 chars

Both call sites log advisory_preview=agg_result.advisory_content[:500]. With multiple reviewers each providing 200-500 words of analysis, the full advisory content could easily exceed this. Consider whether the truncation is too aggressive for observability purposes, or whether a separate debug-level log with the full content would be useful. Not blocking — the current approach is reasonable for structured log storage.

4. AggregatedReviewResult as NamedTuple — good backward compatibility

The choice to use NamedTuple rather than a plain dataclass means the result can still be accessed positionally (e.g., result[0]). The test at test_multi_reviewer.py line 634 explicitly verifies positional access. This is a good defensive design choice given the function signature change.

5. Test coverage is thorough

The TestBuildReviewPrompt class (22 tests) and TestProductionAggregateReviewVerdicts class (14 tests) cover the key scenarios: backward compatibility with old JSON, new field parsing, empty/None verdicts, feedback fallback to summary, delta review, draft vs non-draft paths, and reviewer-type-specific prompt content. The backward compat tests for ReviewVerdict construction with only verdict field (minimal construction) are particularly valuable.

6. conftest.py docker mock improvement is good

The previous pass on ImportError for docker left exception classes undefined, which could cause except NotFound to fail. The new mock with real exception hierarchy (_DockerException → _APIError → _NotFound) is correct. One minor observation: sys.modules.setdefault is used, which means if a test module already injected a different docker mock, it won't be overwritten. This is the right behavior for conftest.

7. Documentation is consistent with code changes

docs/guides/sdlc-pipeline.md accurately reflects the new verdict schema fields, field guidelines, and review conventions. The orchestrator/README.md update to the directory listing is minimal and correct.

— Authored by egg

james-in-a-box

Contract Verification: All 14 Acceptance Criteria Verified

All tasks and acceptance criteria from the architect plan are met. The implementation follows Approach A (Expand Verdict Schema + Align Prompts) as recommended.

Criterion-by-Criterion Verification

#	Criterion	Status	Evidence
AC-1	ReviewVerdict has `analysis` (str, default='') and `suggestions` (str, default='') fields	Verified	`orchestrator/models.py:102-109` — Both fields with `Field(default="")` and descriptive annotations
AC-2	AggregatedReviewResult NamedTuple with verdict, blocking_feedback, advisory_content	Verified	`orchestrator/models.py:114-126` — NamedTuple with all three fields and docstring
AC-3	`_aggregate_review_verdicts()` returns AggregatedReviewResult collecting from ALL verdicts	Verified	`pipelines.py:1784-1832` — Advisory sections (lines 1815-1824) collected outside the needs_revision block, covering all verdicts
AC-4	Advisory content NOT injected into retry prompt prior_feedback	Verified	Both callers (lines 3566-3571, 5542-5547) use only `agg_result.blocking_feedback`; advisory is logged only
AC-5	Verdict format includes analysis/suggestions with always-populate instructions	Verified	`pipelines.py:1617-1630` — JSON template includes both fields; field guidelines say "Always provide detailed analysis regardless of verdict"
AC-6	'empty if approved' language removed	Verified	Grep confirms "empty if approved" not present. New feedback guidance: "Reserved for blocking issues only — Leave empty when approving" (semantically correct)
AC-7	Review conventions (5 quality standards) in SDLC prompts	Verified	`pipelines.py:1558-1587` — "## Review Conventions" with: comprehensive, specific, direct, suggest fixes, provide context
AC-8	Code reviewer 'critical infrastructure' framing	Verified	`pipelines.py:1560-1564` — "critical part of the engineering infrastructure — the last line of defense before code reaches production"
AC-9	Scope preambles with analysis depth per reviewer type	Verified	Code: "file-by-file" (L1025), Contract: "criterion-by-criterion" (L1034), Refine: "section-by-section" (L1043), Plan: "section-by-section" (L1052)
AC-10	Agent-design exempt from detailed analysis	Verified	`pipelines.py:1013-1015` — "a brief approval is acceptable — you do not need to produce a lengthy analysis when there are no concerns"
AC-11	Draft-based reviewer steps expanded from 4 to 6-7	Verified	`pipelines.py:1538-1546` — 7 steps: read thoroughly, cross-reference, cite evidence, evaluate completeness, assess quality, write verdict, commit
AC-12	Backward compatibility for old verdict JSON	Verified	Pydantic defaults handle missing fields. Test `test_backward_compat_old_json_without_new_fields` confirms old JSON parses with empty defaults
AC-13	All existing tests pass	Verified	6767 passed, 85 skipped, 4 warnings. Lint (ruff, mypy) clean. No regressions
AC-14	New tests cover aggregation + backward compat	Verified	`test_multi_reviewer.py`: advisory collection, AggregatedReviewResult fields, backward compat. `test_tier3_execute.py`: new-field parsing, old-format compat. `test_pipeline_prompts.py`: 22 tests for verdict format, conventions, preambles, field guidelines

Task Verification Summary

All 10 implementation tasks across 4 phases are complete:

Phase 1 (Model): ReviewVerdict expanded with analysis/suggestions fields (Task 1)
Phase 2 (Aggregation): AggregatedReviewResult NamedTuple created, aggregation updated, both callers destructure correctly (Tasks 2-3)
Phase 3 (Prompts): Verdict format updated, review conventions added, scope preambles enriched, draft reviewer steps expanded (Tasks 4-7)
Phase 4 (Tests): Multi-reviewer aggregation tests, tier-3 execution tests, and prompt builder tests all updated (Tasks 8-10)

Architecture Decision Compliance

All 6 architecture decisions from the plan are followed:

AD-1: Free-form markdown for analysis field ✓
AD-2: Agent-design reviewer exempt from detailed analysis ✓
AD-3: NamedTuple return type (not expanded tuple) ✓
AD-4: Conventions inlined in prompt builder (not loaded from action/) ✓
AD-5: Advisory content not surfaced to retry prompt ✓
AD-6: No migration of existing verdict files ✓

Contract Integrity Note

The contract file at .egg-state/contracts/897.json has an empty acceptance_criteria array — the criteria were defined in the architect output but not propagated to the contract. This is a pipeline tooling gap (not a code issue). Verification was performed against the architect output's 14 acceptance criteria, which is the authoritative source.

Verdict

Approve — All acceptance criteria met. Implementation is clean, backward compatible, and well-tested.

— Authored by egg

These are Hypothesis testing framework cache/database files that are machine-specific and non-deterministic. They should not be tracked.

egg-reviewer

No agent-mode design concerns (re-review).

The only change since my last review (e8b3c89) is commit b5f3493 which removes the accidentally committed .hypothesis/ cache files and adds .hypothesis/ to .gitignore. This addresses the blocking issue from the general code review — no agent-mode design implications.

My previous agent-mode design approval stands unchanged.

— Authored by egg

egg-reviewer

Re-review: Deepen SDLC review quality with analysis and suggestions fields

Previous blocking issue: resolved

The .hypothesis/constants/ cache files (198 files + 2 binary unicode data files) have been removed from the PR via git rm -r --cached, and .hypothesis/ has been added to .gitignore in the correct location (under the testing/coverage section). The fix matches the recommendation from the previous review exactly.

Delta summary

Single commit (b5f3493e): removes .hypothesis cache files and adds .hypothesis/ to .gitignore. No changes to the substantive code (models, pipelines, tests, docs) since the previous review.

Substantive code

The code changes were reviewed at e8b3c89 and had no blocking issues. The non-blocking observations from that review (advisory truncation at 500 chars, sys.modules.setdefault behavior in conftest) remain noted but are not blocking.

No new issues found.

— Authored by egg

james-in-a-box

Contract Re-verification: Incremental Review

Verdict: Approve (no contract violations)

Delta Since Last Review

One commit since e8b3c89: b5f3493 Remove .hypothesis cache files and add to .gitignore

This commit is housekeeping only — it removes Hypothesis testing framework cache files (machine-specific, non-deterministic) and adds .hypothesis/ to .gitignore. No contract-related source files, test files, or documentation were modified.

Contract Compliance Status

All 13 tasks across 3 phases remain fully implemented and verified:

Phase 1 — Model and Aggregation (5/5 tasks):

TASK-1-1: analysis and suggestions fields on ReviewVerdict — verified in orchestrator/models.py:97-111
TASK-1-2: AggregatedReviewResult NamedTuple — verified in orchestrator/models.py:114-126
TASK-1-3: _aggregate_review_verdicts() returns AggregatedReviewResult, collects advisory from all verdicts — verified in pipelines.py:1784-1832
TASK-1-4: Tier 3 caller uses named fields, only blocking_feedback in retry — verified
TASK-1-5: Main loop caller uses named fields, only blocking_feedback in retry — verified

Phase 2 — Prompt Builder Alignment (4/4 tasks):

TASK-2-1: Verdict JSON template has all 7 fields, "empty if approved" removed — verified (string does not appear in pipelines.py)
TASK-2-2: Review conventions section with 5 quality standards, "critical infrastructure" framing — verified
TASK-2-3: Per-type scope preambles (file-by-file, criterion-by-criterion, etc.) — verified
TASK-2-4: Expanded draft-reviewer procedural steps (7 steps) — verified

Phase 3 — Tests (4/4 tasks):

TASK-3-1: Backward-compat test in test_tier3_execute.py — verified (test_backward_compat_old_json_without_new_fields)
TASK-3-2: TestBuildReviewPrompt class in test_pipeline_prompts.py with 20+ test methods — verified
TASK-3-3: TestProductionAggregateReviewVerdicts in test_multi_reviewer.py with 15 test methods — verified
TASK-3-4: Full test suite passed (6767 passed, per prior review) — no regressions from delta commit

Regression Check

The .hypothesis cleanup commit does not touch any of the following contract-scoped files:

orchestrator/models.py
orchestrator/routes/pipelines.py
orchestrator/tests/test_tier3_execute.py
orchestrator/tests/test_pipeline_prompts.py
tests/workflows/test_multi_reviewer.py
docs/guides/sdlc-pipeline.md

No regressions. No contract violations. All previously verified criteria still hold.

Note on Contract Structure

The formal contract file (897.json) has empty acceptance_criteria and phases arrays — the structured tasks from the plan's yaml-tasks block were not parsed into the contract during pipeline setup. This is a known pipeline infrastructure gap (not a PR issue). Verification was performed against the plan-defined tasks.

— Authored by egg

james-in-a-box · 2026-02-24T09:00:51Z

egg is addressing review feedback...

james-in-a-box · 2026-02-24T09:03:10Z

egg feedback addressed. View run logs

8 previous review(s) hidden.

egg-orchestrator added 15 commits February 24, 2026 07:18

Initialize SDLC contract for issue #897

4958dce

Add analysis for issue #897: SDLC reviewer shallow analysis

524ec12

Add agent-design review verdict for issue #897 refine phase

c2b9f19

Add refine review verdict for issue #897 refine phase

789b546

Authored-by: egg

Persist statefiles after refine phase

075b816

Add architect analysis for issue #897: deepen SDLC review quality

d89c8cb

WIP: auto-commit uncommitted work (architect) [issue-897]

e600e2a

Container 9b1c05f58615245f7dd6774518c44fb90d3f811b8a1c539ada5df9eaba8d091f exited with uncommitted changes. This commit preserves the agent's work-in-progress. Authored-by: egg

Add implementation plan for issue #897: deepen SDLC review quality

35eb1dd

Add gap tests for SDLC review depth improvements

6d3beea

WIP: auto-commit uncommitted work (tester) [issue-897]

fa657dd

Container 004a65f8ea27287f969ec30e7caf40ed62e070578cedaeb6461e6275eec2f497 exited with uncommitted changes. This commit preserves the agent's work-in-progress. Authored-by: egg

docs: Update review verdict schema, aggregation, and conventions for …

020c122

…deeper SDLC reviews

Fix merge conflicts and lint issues in test files

224a06c

Persist statefiles after implement phase

2172dda

Fix minor formatting in pipelines.py

e8b3c89

egg-reviewer bot approved these changes Feb 24, 2026

View reviewed changes