Skip to content

Comments

Deepen SDLC review quality with analysis and suggestions fields#899

Merged
jwbron merged 16 commits intomainfrom
egg/issue-897/work
Feb 24, 2026
Merged

Deepen SDLC review quality with analysis and suggestions fields#899
jwbron merged 16 commits intomainfrom
egg/issue-897/work

Conversation

@james-in-a-box
Copy link
Contributor

Summary

Deepen SDLC pipeline review quality to match PR reviewer thoroughness.

SDLC reviewers were producing shallow ~2 sentence verdicts with empty feedback even when approving, while PR reviewers on the same code produced detailed file-by-file analysis, found specific issues, and drove follow-up fixes. The root cause was threefold: the verdict schema had "feedback": "empty if approved", the SDLC prompts lacked the review conventions and procedural rigor of the PR prompts, and there was no way to surface non-blocking observations.

This PR addresses all three gaps:

  • Expanded verdict schema: Added analysis (always populated) and suggestions (non-blocking, even when approving) fields to ReviewVerdict. Introduced AggregatedReviewResult NamedTuple that separates blocking feedback from advisory content.
  • Enhanced review prompts: Added review conventions section (comprehensive, specific, direct, suggest fixes, provide context) to all SDLC reviewers. Code reviewers get "last line of defense" framing. Each reviewer type gets analysis format guidance (file-by-file for code, criterion-by-criterion for contract, section-by-section for plan/refine). Draft-based reviewers get expanded procedural steps requiring cross-referencing and evidence citing.
  • Advisory content logging: Aggregation logic now collects analysis and suggestions from ALL verdicts (including approved) and logs them for observability, while only passing blocking feedback to revision cycles.

Backward compatible — old verdict JSON without the new fields parses correctly via Pydantic defaults.

Closes #897

Test plan

  • Run pytest orchestrator/tests/test_pipeline_prompts.py -v — verify new TestBuildReviewPrompt class passes (22 tests covering verdict format, conventions, preambles, delta review, field guidelines)
  • Run pytest orchestrator/tests/test_tier3_execute.py -v — verify backward compat and new-field parsing tests pass
  • Run pytest tests/workflows/test_multi_reviewer.py -v — verify aggregation tests including advisory content collection
  • Run full pytest orchestrator/tests/ tests/workflows/ to confirm no regressions
  • Verify that the sdlc-pipeline.md docs accurately reflect the new schema and conventions

Authored-by: egg

egg-orchestrator added 15 commits February 24, 2026 07:18
Container 9b1c05f58615245f7dd6774518c44fb90d3f811b8a1c539ada5df9eaba8d091f exited with uncommitted changes.
This commit preserves the agent's work-in-progress.

Authored-by: egg
Container 004a65f8ea27287f969ec30e7caf40ed62e070578cedaeb6461e6275eec2f497 exited with uncommitted changes.
This commit preserves the agent's work-in-progress.

Authored-by: egg
When the docker SDK is not installed, other test files inject MagicMock
into sys.modules["docker"], causing docker_client.py's
`from docker.errors import NotFound, DockerException` to bind MagicMock
objects instead of real exception classes. This breaks `except NotFound`
clauses at runtime (TypeError: catching classes that do not inherit from
BaseException).

The conftest now creates proper exception class stubs matching the docker
SDK hierarchy and registers them in sys.modules before test collection,
so that both the source module and tests use real exception classes.

Fixes 2 failing tests:
- TestDockerClientConnection.test_is_connected_false
- TestContainerOperations.test_start_container_not_found
Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns.

The structured verdict JSON (with new analysis, suggestions, feedback fields) has a legitimate downstream consumer — the _aggregate_review_verdicts() function programmatically separates blocking vs. advisory content. This is machine-to-machine parsing, which is the appropriate use case for structured output per the design guidelines.

The review conventions added to prompts describe quality attributes ("be comprehensive", "be specific", "suggest fixes") and analysis format expectations ("file-by-file", "criterion-by-criterion") rather than rigid step-by-step procedures. The expanded procedural steps for draft-based reviewers are methodological guidance that helps the agent understand what thorough review looks like for each reviewer type — orientation, not constraint.

The agent-design reviewer preamble change (allowing brief approvals when no concerns exist) is a positive alignment — it avoids demanding unnecessary ceremony from the agent.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Deepen SDLC review quality with analysis and suggestions fields

Blocking Issue

1. 198 .hypothesis/constants/ files committed (must fix)

This PR introduces 198 new files under .hypothesis/constants/. These are Hypothesis testing framework cache/database files — machine-specific, non-deterministic, and never previously tracked in this repo. No merged PR in the repo history has ever included .hypothesis/ files.

These must be removed from the PR and .hypothesis/ should be added to .gitignore. They inflate the diff from ~30 substantive changed lines to ~3000 additions and will create merge noise going forward.

Fix:

git rm -r --cached .hypothesis/
echo '.hypothesis/' >> .gitignore
git add .gitignore
git commit -m "Remove .hypothesis cache files and add to .gitignore"

Non-blocking observations

The core code changes (model expansion, prompt enhancement, aggregation refactor) are well-structured. Specific observations:

2. _aggregate_review_verdicts — advisory collection is outside the needs_revision guard (correct but worth noting)

In the aggregation function (pipelines.py:1812-1824), the advisory content collection block runs for ALL verdicts (not just needs_revision), which is the stated intent. Both call sites (lines ~3541 and ~5498) correctly use named field access (agg_result.verdict, agg_result.blocking_feedback, agg_result.advisory_content). The NamedTuple design is clean and the return type change from tuple[str, str] to AggregatedReviewResult is handled correctly at all call sites.

3. Advisory content logging truncates at 500 chars

Both call sites log advisory_preview=agg_result.advisory_content[:500]. With multiple reviewers each providing 200-500 words of analysis, the full advisory content could easily exceed this. Consider whether the truncation is too aggressive for observability purposes, or whether a separate debug-level log with the full content would be useful. Not blocking — the current approach is reasonable for structured log storage.

4. AggregatedReviewResult as NamedTuple — good backward compatibility

The choice to use NamedTuple rather than a plain dataclass means the result can still be accessed positionally (e.g., result[0]). The test at test_multi_reviewer.py line 634 explicitly verifies positional access. This is a good defensive design choice given the function signature change.

5. Test coverage is thorough

The TestBuildReviewPrompt class (22 tests) and TestProductionAggregateReviewVerdicts class (14 tests) cover the key scenarios: backward compatibility with old JSON, new field parsing, empty/None verdicts, feedback fallback to summary, delta review, draft vs non-draft paths, and reviewer-type-specific prompt content. The backward compat tests for ReviewVerdict construction with only verdict field (minimal construction) are particularly valuable.

6. conftest.py docker mock improvement is good

The previous pass on ImportError for docker left exception classes undefined, which could cause except NotFound to fail. The new mock with real exception hierarchy (_DockerException_APIError_NotFound) is correct. One minor observation: sys.modules.setdefault is used, which means if a test module already injected a different docker mock, it won't be overwritten. This is the right behavior for conftest.

7. Documentation is consistent with code changes

docs/guides/sdlc-pipeline.md accurately reflects the new verdict schema fields, field guidelines, and review conventions. The orchestrator/README.md update to the directory listing is minimal and correct.

— Authored by egg

Copy link
Contributor Author

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract Verification: All 14 Acceptance Criteria Verified

All tasks and acceptance criteria from the architect plan are met. The implementation follows Approach A (Expand Verdict Schema + Align Prompts) as recommended.

Criterion-by-Criterion Verification

# Criterion Status Evidence
AC-1 ReviewVerdict has analysis (str, default='') and suggestions (str, default='') fields Verified orchestrator/models.py:102-109 — Both fields with Field(default="") and descriptive annotations
AC-2 AggregatedReviewResult NamedTuple with verdict, blocking_feedback, advisory_content Verified orchestrator/models.py:114-126 — NamedTuple with all three fields and docstring
AC-3 _aggregate_review_verdicts() returns AggregatedReviewResult collecting from ALL verdicts Verified pipelines.py:1784-1832 — Advisory sections (lines 1815-1824) collected outside the needs_revision block, covering all verdicts
AC-4 Advisory content NOT injected into retry prompt prior_feedback Verified Both callers (lines 3566-3571, 5542-5547) use only agg_result.blocking_feedback; advisory is logged only
AC-5 Verdict format includes analysis/suggestions with always-populate instructions Verified pipelines.py:1617-1630 — JSON template includes both fields; field guidelines say "Always provide detailed analysis regardless of verdict"
AC-6 'empty if approved' language removed Verified Grep confirms "empty if approved" not present. New feedback guidance: "Reserved for blocking issues only — Leave empty when approving" (semantically correct)
AC-7 Review conventions (5 quality standards) in SDLC prompts Verified pipelines.py:1558-1587 — "## Review Conventions" with: comprehensive, specific, direct, suggest fixes, provide context
AC-8 Code reviewer 'critical infrastructure' framing Verified pipelines.py:1560-1564 — "critical part of the engineering infrastructure — the last line of defense before code reaches production"
AC-9 Scope preambles with analysis depth per reviewer type Verified Code: "file-by-file" (L1025), Contract: "criterion-by-criterion" (L1034), Refine: "section-by-section" (L1043), Plan: "section-by-section" (L1052)
AC-10 Agent-design exempt from detailed analysis Verified pipelines.py:1013-1015 — "a brief approval is acceptable — you do not need to produce a lengthy analysis when there are no concerns"
AC-11 Draft-based reviewer steps expanded from 4 to 6-7 Verified pipelines.py:1538-1546 — 7 steps: read thoroughly, cross-reference, cite evidence, evaluate completeness, assess quality, write verdict, commit
AC-12 Backward compatibility for old verdict JSON Verified Pydantic defaults handle missing fields. Test test_backward_compat_old_json_without_new_fields confirms old JSON parses with empty defaults
AC-13 All existing tests pass Verified 6767 passed, 85 skipped, 4 warnings. Lint (ruff, mypy) clean. No regressions
AC-14 New tests cover aggregation + backward compat Verified test_multi_reviewer.py: advisory collection, AggregatedReviewResult fields, backward compat. test_tier3_execute.py: new-field parsing, old-format compat. test_pipeline_prompts.py: 22 tests for verdict format, conventions, preambles, field guidelines

Task Verification Summary

All 10 implementation tasks across 4 phases are complete:

  • Phase 1 (Model): ReviewVerdict expanded with analysis/suggestions fields (Task 1)
  • Phase 2 (Aggregation): AggregatedReviewResult NamedTuple created, aggregation updated, both callers destructure correctly (Tasks 2-3)
  • Phase 3 (Prompts): Verdict format updated, review conventions added, scope preambles enriched, draft reviewer steps expanded (Tasks 4-7)
  • Phase 4 (Tests): Multi-reviewer aggregation tests, tier-3 execution tests, and prompt builder tests all updated (Tasks 8-10)

Architecture Decision Compliance

All 6 architecture decisions from the plan are followed:

  • AD-1: Free-form markdown for analysis field ✓
  • AD-2: Agent-design reviewer exempt from detailed analysis ✓
  • AD-3: NamedTuple return type (not expanded tuple) ✓
  • AD-4: Conventions inlined in prompt builder (not loaded from action/) ✓
  • AD-5: Advisory content not surfaced to retry prompt ✓
  • AD-6: No migration of existing verdict files ✓

Contract Integrity Note

The contract file at .egg-state/contracts/897.json has an empty acceptance_criteria array — the criteria were defined in the architect output but not propagated to the contract. This is a pipeline tooling gap (not a code issue). Verification was performed against the architect output's 14 acceptance criteria, which is the authoritative source.

Verdict

Approve — All acceptance criteria met. Implementation is clean, backward compatible, and well-tested.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

These are Hypothesis testing framework cache/database files that are
machine-specific and non-deterministic. They should not be tracked.
@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No agent-mode design concerns (re-review).

The only change since my last review (e8b3c89) is commit b5f3493 which removes the accidentally committed .hypothesis/ cache files and adds .hypothesis/ to .gitignore. This addresses the blocking issue from the general code review — no agent-mode design implications.

My previous agent-mode design approval stands unchanged.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor

@egg-reviewer egg-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: Deepen SDLC review quality with analysis and suggestions fields

Previous blocking issue: resolved

The .hypothesis/constants/ cache files (198 files + 2 binary unicode data files) have been removed from the PR via git rm -r --cached, and .hypothesis/ has been added to .gitignore in the correct location (under the testing/coverage section). The fix matches the recommendation from the previous review exactly.

Delta summary

Single commit (b5f3493e): removes .hypothesis cache files and adds .hypothesis/ to .gitignore. No changes to the substantive code (models, pipelines, tests, docs) since the previous review.

Substantive code

The code changes were reviewed at e8b3c89 and had no blocking issues. The non-blocking observations from that review (advisory truncation at 500 chars, sys.modules.setdefault behavior in conftest) remain noted but are not blocking.

No new issues found.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Contributor Author

@james-in-a-box james-in-a-box bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract Re-verification: Incremental Review

Verdict: Approve (no contract violations)

Delta Since Last Review

One commit since e8b3c89: b5f3493 Remove .hypothesis cache files and add to .gitignore

This commit is housekeeping only — it removes Hypothesis testing framework cache files (machine-specific, non-deterministic) and adds .hypothesis/ to .gitignore. No contract-related source files, test files, or documentation were modified.

Contract Compliance Status

All 13 tasks across 3 phases remain fully implemented and verified:

Phase 1 — Model and Aggregation (5/5 tasks):

  • TASK-1-1: analysis and suggestions fields on ReviewVerdict — verified in orchestrator/models.py:97-111
  • TASK-1-2: AggregatedReviewResult NamedTuple — verified in orchestrator/models.py:114-126
  • TASK-1-3: _aggregate_review_verdicts() returns AggregatedReviewResult, collects advisory from all verdicts — verified in pipelines.py:1784-1832
  • TASK-1-4: Tier 3 caller uses named fields, only blocking_feedback in retry — verified
  • TASK-1-5: Main loop caller uses named fields, only blocking_feedback in retry — verified

Phase 2 — Prompt Builder Alignment (4/4 tasks):

  • TASK-2-1: Verdict JSON template has all 7 fields, "empty if approved" removed — verified (string does not appear in pipelines.py)
  • TASK-2-2: Review conventions section with 5 quality standards, "critical infrastructure" framing — verified
  • TASK-2-3: Per-type scope preambles (file-by-file, criterion-by-criterion, etc.) — verified
  • TASK-2-4: Expanded draft-reviewer procedural steps (7 steps) — verified

Phase 3 — Tests (4/4 tasks):

  • TASK-3-1: Backward-compat test in test_tier3_execute.py — verified (test_backward_compat_old_json_without_new_fields)
  • TASK-3-2: TestBuildReviewPrompt class in test_pipeline_prompts.py with 20+ test methods — verified
  • TASK-3-3: TestProductionAggregateReviewVerdicts in test_multi_reviewer.py with 15 test methods — verified
  • TASK-3-4: Full test suite passed (6767 passed, per prior review) — no regressions from delta commit

Regression Check

The .hypothesis cleanup commit does not touch any of the following contract-scoped files:

  • orchestrator/models.py
  • orchestrator/routes/pipelines.py
  • orchestrator/tests/test_tier3_execute.py
  • orchestrator/tests/test_pipeline_prompts.py
  • tests/workflows/test_multi_reviewer.py
  • docs/guides/sdlc-pipeline.md

No regressions. No contract violations. All previously verified criteria still hold.

Note on Contract Structure

The formal contract file (897.json) has empty acceptance_criteria and phases arrays — the structured tasks from the plan's yaml-tasks block were not parsed into the contract during pipeline setup. This is a known pipeline infrastructure gap (not a PR issue). Verification was performed against the plan-defined tasks.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box
Copy link
Contributor Author

egg is addressing review feedback...

@james-in-a-box
Copy link
Contributor Author

egg feedback addressed. View run logs

8 previous review(s) hidden.

@jwbron jwbron merged commit bbd0272 into main Feb 24, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDLC pipeline reviewers produce shallow analysis compared to PR reviewers

1 participant