From d1585c823f14429dd84ad4346c4b831716fb53e7 Mon Sep 17 00:00:00 2001
From: Nick Gomez <nick@inkeep.com>
Date: Fri, 20 Feb 2026 15:39:38 -0800
Subject: [PATCH] Add QA subprocess loop and script-side template assembly for
 iteration scripts

Restructures /qa from in-process execution to a subprocess iteration loop
(matching /implement's architecture) and moves template assembly from the
orchestrator into the shell scripts themselves.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 plugins/eng/skills/implement/SKILL.md         |  48 +--
 .../eng/skills/implement/scripts/implement.sh | 172 +++++++-
 .../templates/implement-prompt.template.md    |  21 +-
 plugins/eng/skills/qa/SKILL.md                | 268 ++++++-------
 .../skills/qa/references/testing-guidance.md  | 170 ++++++++
 plugins/eng/skills/qa/scripts/qa.sh           | 376 ++++++++++++++++++
 plugins/eng/skills/qa/scripts/validate-qa.ts  | 299 ++++++++++++++
 .../skills/qa/templates/qa-prompt.template.md | 261 ++++++++++++
 plugins/eng/skills/ship/SKILL.md              |  17 +-
 9 files changed, 1429 insertions(+), 203 deletions(-)
 create mode 100644 plugins/eng/skills/qa/references/testing-guidance.md
 create mode 100755 plugins/eng/skills/qa/scripts/qa.sh
 create mode 100644 plugins/eng/skills/qa/scripts/validate-qa.ts
 create mode 100644 plugins/eng/skills/qa/templates/qa-prompt.template.md

diff --git a/plugins/eng/skills/implement/SKILL.md b/plugins/eng/skills/implement/SKILL.md
index 5662640d..9385cce1 100644
--- a/plugins/eng/skills/implement/SKILL.md
+++ b/plugins/eng/skills/implement/SKILL.md
@@ -428,53 +428,37 @@ Put spec.json at `tmp/ship/spec.json`. Create `tmp/ship/` if it doesn't exist (`
 
 If on `main` or `master`, warn before proceeding — the implement skill should normally run on a feature branch. If no branching model exists (e.g., container environment with no PR workflow), proceed with caution and ensure commits are isolated.
 
-### Craft the implementation prompt
+### Template assembly (handled by implement.sh)
 
-**Load:** `templates/implement-prompt.template.md`
+The `implement.sh` script handles all template assembly automatically. It reads `templates/implement-prompt.template.md`, selects the correct variant based on `--spec-path`, extracts `implementationContext` from spec.json, computes a fresh git diff each iteration, and fills all `{{PLACEHOLDERS}}`.
 
-The template contains two complete prompt variants with `{{PLACEHOLDER}}` syntax. Choose ONE variant and fill all placeholders.
+**Your job in Phase 2:** Ensure spec.json has a rich `implementationContext` field — this becomes the iteration agent's codebase context. Include specific patterns, shared vocabulary, abstractions in the area being modified, and repo conventions from CLAUDE.md (testing patterns, file locations, formatting). See Phase 1's implementationContext guidance for details.
 
-**Choose variant:**
-- **Variant A** — when a SPEC.md path is available (directly provided or from Phase 1)
-- **Variant B** — when only spec.json is available (no SPEC.md)
+**You do NOT need to:** read the template, select a variant, fill placeholders, or save a prompt file. The script does all of this.
 
-**Conditionality lives HERE (in Phase 2 construction), NOT in the iteration prompt.** The iteration agent sees a single, unconditional workflow — never both variants, never conditional "if spec is available" logic.
+### Copy artifacts to execution location
 
-**Fill `{{CODEBASE_CONTEXT}}`:** Include the specific patterns, shared vocabulary, and abstractions in the area being modified — more actionable than generic CLAUDE.md guidance. Examples: "The API follows RESTful patterns under /api/tasks/", "Auth uses tenant-scoped middleware in auth.ts", "Data access uses the repository pattern in data-access/". Also include repo conventions from CLAUDE.md (testing patterns, file locations, formatting) that the iteration agent needs.
-
-**Fill quality gate commands:** Use the commands from Inputs (defaults: `pnpm typecheck`, `pnpm lint`, `pnpm test --run`) — override with `--typecheck-cmd`, `--lint-cmd`, `--test-cmd` if provided.
-
-**Fill `{{SPEC_PATH}}`** (Variant A only): Use a path relative to the working directory (e.g., `.claude/specs/my-feature/SPEC.md`). Relative paths work across execution contexts (host, Docker, worktree). Do NOT use absolute paths — they break when the prompt is executed in a different environment. Do NOT embed spec content in the prompt — the iteration agent reads it via the Read tool each iteration.
-
-### Save the prompt
-
-Save the crafted implementation prompt to `tmp/ship/implement-prompt.md`. This file is consumed by Phase 3 (`scripts/implement.sh`) for automated execution, or by the user for manual iteration (`claude -p`).
-
-### Copy implement.sh to execution location
-
-Copy the skill's canonical `scripts/implement.sh` to `tmp/ship/implement.sh` in the working directory and make it executable:
+Copy the skill's canonical `scripts/implement.sh` and template to `tmp/ship/` in the working directory and make the script executable:
 
 ```bash
 cp <path-to-skill>/scripts/implement.sh tmp/ship/implement.sh
+cp <path-to-skill>/templates/implement-prompt.template.md tmp/ship/implement-prompt.template.md
 chmod +x tmp/ship/implement.sh
 ```
 
-This places the iteration loop script alongside the implementation prompt (`tmp/ship/implement-prompt.md`) as a paired execution artifact. The copy enables:
+This places the iteration loop script and template as paired execution artifacts. The copy enables:
 - **Manual execution** — users can run `tmp/ship/implement.sh --force` directly without knowing the skill's install path
 - **Docker execution** — containers access `tmp/ship/implement.sh` via bind mount (see `references/execution.md`)
 
-Phase 3 on host uses the skill's own `scripts/implement.sh` directly (validate-spec.ts is available next to it). The `tmp/ship/implement.sh` copy is for Docker, manual, and external execution contexts.
+Phase 3 on host uses the skill's own `scripts/implement.sh` directly (template is auto-detected next to it). The `tmp/ship/` copies are for Docker, manual, and external execution contexts.
 
 ### Phase 2 checklist
 
 - [ ] spec.json validated against SPEC.md (if available)
 - [ ] spec.json schema validated (via `scripts/validate-spec.ts` if bun available)
 - [ ] Stories correctly scoped, ordered, and with verifiable criteria
-- [ ] Implementation prompt crafted from template with correct variant (A or B)
-- [ ] All `{{PLACEHOLDERS}}` filled (spec path, quality gates, codebase context)
-- [ ] Completion signal present in saved prompt (included in template — verify not accidentally removed)
-- [ ] Prompt saved to `tmp/ship/implement-prompt.md`
-- [ ] `implement.sh` copied to `tmp/ship/implement.sh` and made executable
+- [ ] `implementationContext` is rich and actionable (architecture, constraints, key decisions, codebase patterns)
+- [ ] `implement.sh` and template copied to `tmp/ship/` and made executable
 
 ---
 
@@ -566,10 +550,12 @@ Count the incomplete stories in spec.json and select parameters:
 
 Run in background to avoid the Bash tool's 600-second timeout.
 
+Pass `--spec-path` when a SPEC.md is available (selects Variant A of the prompt template). Pass quality gate commands if overridden from defaults.
+
 **Host execution (default):**
 
 ```
-Bash(command: "<path-to-skill>/scripts/implement.sh --max-iterations <N> --max-turns <M> --force",
+Bash(command: "<path-to-skill>/scripts/implement.sh --max-iterations <N> --max-turns <M> --force --spec-path <SPEC_PATH> --typecheck-cmd '<CMD>' --lint-cmd '<CMD>' --test-cmd '<CMD>'",
      run_in_background: true,
      description: "Implement execution run 1")
 ```
@@ -577,12 +563,14 @@ Bash(command: "<path-to-skill>/scripts/implement.sh --max-iterations <N> --max-t
 **Docker execution (when `--docker` was passed — compose file resolved in Step 3):**
 
 ```
-Bash(command: "docker compose -f <compose-file> exec sandbox tmp/ship/implement.sh --max-iterations <N> --max-turns <M> --force",
+Bash(command: "docker compose -f <compose-file> exec sandbox tmp/ship/implement.sh --max-iterations <N> --max-turns <M> --force --template tmp/ship/implement-prompt.template.md --spec-path <SPEC_PATH> --typecheck-cmd '<CMD>' --lint-cmd '<CMD>' --test-cmd '<CMD>'",
      run_in_background: true,
      description: "Implement Docker execution run 1")
 ```
 
-Always pass `--force` — background execution has no TTY for interactive prompts.
+Always pass `--force` — background execution has no TTY for interactive prompts. In Docker, pass `--template tmp/ship/implement-prompt.template.md` since the skill directory isn't available inside the container.
+
+Omit `--spec-path` when no SPEC.md is available (selects Variant B). Omit quality gate args to use defaults (`pnpm typecheck`, `pnpm lint`, `pnpm test --run`).
 
 Poll for completion using `TaskOutput(block: false)` at intervals. While waiting, do lightweight work (re-read spec, review task list) but do NOT make code changes that could conflict.
 
diff --git a/plugins/eng/skills/implement/scripts/implement.sh b/plugins/eng/skills/implement/scripts/implement.sh
index d2c78084..9ee1645d 100755
--- a/plugins/eng/skills/implement/scripts/implement.sh
+++ b/plugins/eng/skills/implement/scripts/implement.sh
@@ -4,6 +4,14 @@ set -e
 # Implement Iteration Loop
 # Spawns independent Claude Code processes to implement user stories from spec.json.
 # Each iteration is a fresh process with no memory — state persists via files and git.
+#
+# This script also assembles the iteration prompt from the template by injecting:
+#   - {{SPEC_PATH}} from --spec-path argument
+#   - {{TYPECHECK_CMD}} from --typecheck-cmd argument (default: pnpm typecheck)
+#   - {{LINT_CMD}} from --lint-cmd argument (default: pnpm lint)
+#   - {{TEST_CMD}} from --test-cmd argument (default: pnpm test --run)
+#   - {{CODEBASE_CONTEXT}} from spec.json implementationContext
+#   - {{DIFF}} cleaned git diff (full or stat tree depending on size)
 
 # --- Ship directory (configurable via CLAUDE_SHIP_DIR env var) ---
 SHIP_DIR="${CLAUDE_SHIP_DIR:-tmp/ship}"
@@ -11,16 +19,30 @@ SHIP_DIR="${CLAUDE_SHIP_DIR:-tmp/ship}"
 # --- Defaults ---
 MAX_ITERATIONS=10
 MAX_TURNS=75
-PROMPT_FILE="$SHIP_DIR/implement-prompt.md"
+TEMPLATE_FILE=""
 SPEC_FILE="$SHIP_DIR/spec.json"
+PROMPT_FILE="$SHIP_DIR/implement-prompt.md"
 PROGRESS_FILE="$SHIP_DIR/progress.txt"
+DIFF_FILE="$SHIP_DIR/implement-diff.txt"
+SPEC_PATH=""
+TYPECHECK_CMD="pnpm typecheck"
+LINT_CMD="pnpm lint"
+TEST_CMD="pnpm test --run"
 FORCE=false
 PROTECTED_BRANCHES="main master"
 
+# ~5-10K tokens ≈ 30K characters
+DIFF_MAX_CHARS=30000
+
 # --- Script paths ---
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 VALIDATE_SCRIPT="$SCRIPT_DIR/validate-spec.ts"
 
+# --- Noise filters (same as ship-stop-hook.sh) ---
+FILTER_LOCK='(pnpm-lock\.yaml|package-lock\.json|yarn\.lock|\.lock$)'
+FILTER_SHIP='^.. tmp/ship/'
+FILTER_BUILD='(dist|\.next|build|\.turbo|node_modules)/'
+
 # --- Colors ---
 RED='\033[0;31m'
 GREEN='\033[0;32m'
@@ -35,8 +57,12 @@ usage() {
     echo "Options:"
     echo "  --max-iterations N   Max iteration loops (default: 10)"
     echo "  --max-turns N        Max agentic turns per iteration (default: 75)"
-    echo "  --prompt FILE        Prompt file path (default: tmp/ship/implement-prompt.md)"
-    echo "  --spec FILE          Spec JSON file path (default: tmp/ship/spec.json)"
+    echo "  --template FILE      Prompt template file (auto-detected from skill if omitted)"
+    echo "  --spec FILE          Spec JSON file path (default: $SHIP_DIR/spec.json)"
+    echo "  --spec-path PATH     Path to SPEC.md (for template variant A)"
+    echo "  --typecheck-cmd CMD  Typecheck command (default: pnpm typecheck)"
+    echo "  --lint-cmd CMD       Lint command (default: pnpm lint)"
+    echo "  --test-cmd CMD       Test command (default: pnpm test --run)"
     echo "  --force              Skip uncommitted changes prompt"
     echo "  --create-branch, -b  Create/checkout branch from spec.json branchName"
     echo "  -h, --help           Show this help"
@@ -49,8 +75,12 @@ while [[ $# -gt 0 ]]; do
     case $1 in
         --max-iterations) MAX_ITERATIONS="$2"; shift 2 ;;
         --max-turns) MAX_TURNS="$2"; shift 2 ;;
-        --prompt) PROMPT_FILE="$2"; shift 2 ;;
+        --template) TEMPLATE_FILE="$2"; shift 2 ;;
         --spec) SPEC_FILE="$2"; shift 2 ;;
+        --spec-path) SPEC_PATH="$2"; shift 2 ;;
+        --typecheck-cmd) TYPECHECK_CMD="$2"; shift 2 ;;
+        --lint-cmd) LINT_CMD="$2"; shift 2 ;;
+        --test-cmd) TEST_CMD="$2"; shift 2 ;;
         --force) FORCE=true; shift ;;
         --create-branch|-b) CREATE_BRANCH=true; shift ;;
         -h|--help) usage ;;
@@ -58,6 +88,11 @@ while [[ $# -gt 0 ]]; do
     esac
 done
 
+# --- Auto-detect template if not provided ---
+if [[ -z "$TEMPLATE_FILE" ]]; then
+    TEMPLATE_FILE="$SCRIPT_DIR/../templates/implement-prompt.template.md"
+fi
+
 echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
 echo -e "${BLUE}  Implement Iteration Loop${NC}"
 echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
@@ -116,14 +151,13 @@ fi
 echo -e "${YELLOW}Branch:${NC}          $CURRENT_BRANCH"
 echo -e "${YELLOW}Max iterations:${NC}  $MAX_ITERATIONS"
 echo -e "${YELLOW}Max turns:${NC}       $MAX_TURNS"
-echo -e "${YELLOW}Prompt:${NC}          $PROMPT_FILE"
+echo -e "${YELLOW}Template:${NC}        $TEMPLATE_FILE"
 echo -e "${YELLOW}Spec JSON:${NC}       $SPEC_FILE"
 echo ""
 
 # --- Check required files ---
-if [[ ! -f "$PROMPT_FILE" ]]; then
-    echo -e "${RED}Error: Prompt file not found: $PROMPT_FILE${NC}"
-    echo "Run /implement to generate the implementation prompt first."
+if [[ ! -f "$TEMPLATE_FILE" ]]; then
+    echo -e "${RED}Error: Template file not found: $TEMPLATE_FILE${NC}"
     exit 1
 fi
 
@@ -160,6 +194,121 @@ if [[ ! -f "$PROGRESS_FILE" ]]; then
     echo "" >> "$PROGRESS_FILE"
 fi
 
+# --- Compute cleaned diff (full or stat tree based on size) ---
+compute_diff() {
+    local MERGE_BASE
+    MERGE_BASE=$(git merge-base main HEAD 2>/dev/null || echo "HEAD~10")
+
+    # Full cleaned diff
+    local FULL_DIFF
+    FULL_DIFF=$(git diff "$MERGE_BASE"...HEAD \
+        -- ':!*.lock' ':!*lock.json' ':!*lock.yaml' \
+           ':!tmp/' ':!dist/' ':!build/' ':!.next/' ':!.turbo/' ':!node_modules/' \
+        2>/dev/null || echo "")
+
+    local DIFF_SIZE=${#FULL_DIFF}
+
+    if [[ "$DIFF_SIZE" -le "$DIFF_MAX_CHARS" && "$DIFF_SIZE" -gt 0 ]]; then
+        # Small enough — use full diff
+        echo "# Git diff (full — $(echo "$FULL_DIFF" | wc -l | tr -d ' ') lines)"
+        echo ""
+        echo "$FULL_DIFF"
+    elif [[ "$DIFF_SIZE" -gt "$DIFF_MAX_CHARS" ]]; then
+        # Too large — fall back to stat tree
+        echo "# Git diff (stat tree — full diff too large at ~$((DIFF_SIZE / 4)) tokens)"
+        echo ""
+        echo "Full diff exceeds token budget. Showing file-level summary."
+        echo "Read specific files with the Read tool for details."
+        echo ""
+        git diff --stat "$MERGE_BASE"...HEAD \
+            -- ':!*.lock' ':!*lock.json' ':!*lock.yaml' \
+               ':!tmp/' ':!dist/' ':!build/' ':!.next/' ':!.turbo/' ':!node_modules/' \
+            2>/dev/null || echo "(no stat available)"
+    else
+        echo "# Git diff"
+        echo ""
+        echo "(no changes detected against main)"
+    fi
+}
+
+# --- Assemble prompt from template ---
+assemble_prompt() {
+    # Extract implementationContext from spec.json
+    local IMPL_CONTEXT
+    if command -v jq &> /dev/null; then
+        IMPL_CONTEXT=$(jq -r '.implementationContext // ""' "$SPEC_FILE")
+    else
+        IMPL_CONTEXT=""
+    fi
+
+    # Compute fresh diff
+    local DIFF
+    DIFF=$(compute_diff)
+
+    # Write diff file for reference
+    echo "$DIFF" > "$DIFF_FILE"
+
+    # Write content to temp files for safe Python substitution
+    local TEMPLATE_TMP IMPL_TMP DIFF_TMP
+    TEMPLATE_TMP=$(mktemp)
+    IMPL_TMP=$(mktemp)
+    DIFF_TMP=$(mktemp)
+    cat "$TEMPLATE_FILE" > "$TEMPLATE_TMP"
+    echo "$IMPL_CONTEXT" > "$IMPL_TMP"
+    echo "$DIFF" > "$DIFF_TMP"
+
+    # Use python3 for variant selection + multi-line placeholder substitution
+    python3 -c "
+import sys
+
+with open('$TEMPLATE_TMP', 'r') as f:
+    template = f.read()
+with open('$IMPL_TMP', 'r') as f:
+    impl_context = f.read().strip()
+with open('$DIFF_TMP', 'r') as f:
+    diff = f.read().strip()
+
+spec_path = '$SPEC_PATH'
+
+# --- Variant selection ---
+# Select Variant A (with SPEC.md) or Variant B (without) based on --spec-path
+if spec_path:
+    marker_start = '## Variant A'
+    marker_end = '*End of Variant A*'
+else:
+    marker_start = '## Variant B'
+    marker_end = '*End of Variant B*'
+
+try:
+    start_idx = template.index(marker_start)
+    end_idx = template.index(marker_end)
+    section = template[start_idx:end_idx]
+    # Extract prompt content: between first --- and last --- in the section
+    first_hr = section.index('\n---\n') + 5
+    content = section[first_hr:].rstrip()
+    last_hr = content.rfind('\n---')
+    if last_hr > 0:
+        content = content[:last_hr].strip()
+except (ValueError, IndexError):
+    # Fallback: use full template if markers not found
+    content = template
+
+# --- Placeholder substitution ---
+content = content.replace('{{SPEC_PATH}}', spec_path)
+content = content.replace('{{TYPECHECK_CMD}}', '$TYPECHECK_CMD')
+content = content.replace('{{LINT_CMD}}', '$LINT_CMD')
+content = content.replace('{{TEST_CMD}}', '$TEST_CMD')
+content = content.replace('{{CODEBASE_CONTEXT}}', impl_context)
+content = content.replace('{{DIFF}}', diff)
+
+with open('$PROMPT_FILE', 'w') as f:
+    f.write(content)
+" 2>/dev/null
+
+    # Cleanup temp files
+    rm -f "$TEMPLATE_TMP" "$IMPL_TMP" "$DIFF_TMP"
+}
+
 # --- Main iteration loop ---
 for ((i=1; i<=MAX_ITERATIONS; i++)); do
     echo ""
@@ -171,6 +320,10 @@ for ((i=1; i<=MAX_ITERATIONS; i++)); do
     echo "## Iteration $i - $(date)" >> "$PROGRESS_FILE"
     echo "" >> "$PROGRESS_FILE"
 
+    # Assemble prompt fresh each iteration (diff changes after commits)
+    echo -e "${YELLOW}Assembling prompt...${NC}"
+    assemble_prompt
+
     OUTPUT_FILE=$(mktemp)
 
     # Spawn a fresh Claude Code process for this iteration.
@@ -179,12 +332,13 @@ for ((i=1; i<=MAX_ITERATIONS; i++)); do
     # --dangerously-skip-permissions: no TTY for confirmation in -p mode.
     # --max-turns: prevent runaway iterations.
     # --output-format json: structured output for completion detection.
+    # < /dev/null: prevent stdin hang in nested subprocess invocations.
     env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT claude \
         -p "$(cat "$PROMPT_FILE")" \
         --dangerously-skip-permissions \
         --max-turns "$MAX_TURNS" \
         --output-format json \
-        2>&1 | tee "$OUTPUT_FILE" || true
+        < /dev/null 2>&1 | tee "$OUTPUT_FILE" || true
 
     # --- Post-iteration status ---
     TOTAL_STORIES=0
diff --git a/plugins/eng/skills/implement/templates/implement-prompt.template.md b/plugins/eng/skills/implement/templates/implement-prompt.template.md
index 688399b6..7d839ffe 100644
--- a/plugins/eng/skills/implement/templates/implement-prompt.template.md
+++ b/plugins/eng/skills/implement/templates/implement-prompt.template.md
@@ -6,7 +6,7 @@ Impact: Without this template, the iteration prompt must be reconstructed from m
 
 # Implementation Prompt Template
 
-This file contains two complete prompt variants for the iteration agents. The Phase 2 agent selects **ONE** variant, fills all `{{PLACEHOLDERS}}`, and saves the result to `tmp/ship/implement-prompt.md`.
+This file contains two complete prompt variants for the iteration agents. The `implement.sh` script selects the correct variant based on `--spec-path`, fills all `{{PLACEHOLDERS}}`, and saves the result to `tmp/ship/implement-prompt.md`.
 
 **Do NOT include both variants in the saved prompt.** The iteration agent sees a single, unconditional workflow — never both variants, never conditional "if spec is available" logic.
 
@@ -14,11 +14,12 @@ This file contains two complete prompt variants for the iteration agents. The Ph
 
 | Placeholder | Source | Used in |
 |---|---|---|
-| `{{SPEC_PATH}}` | Path to SPEC.md relative to working directory (e.g., `.claude/specs/my-feature/SPEC.md`) | Variant A only |
-| `{{TYPECHECK_CMD}}` | `--typecheck-cmd` input or default `pnpm typecheck` | Both |
-| `{{LINT_CMD}}` | `--lint-cmd` input or default `pnpm lint` | Both |
-| `{{TEST_CMD}}` | `--test-cmd` input or default `pnpm test --run` | Both |
-| `{{CODEBASE_CONTEXT}}` | Key patterns, shared vocabulary, and abstractions from the target codebase area — see SKILL.md Phase 2 for guidance on what to include | Both |
+| `{{SPEC_PATH}}` | Path to SPEC.md — from `--spec-path` argument, injected by implement.sh | Variant A only |
+| `{{TYPECHECK_CMD}}` | `--typecheck-cmd` argument or default `pnpm typecheck` — injected by implement.sh | Both |
+| `{{LINT_CMD}}` | `--lint-cmd` argument or default `pnpm lint` — injected by implement.sh | Both |
+| `{{TEST_CMD}}` | `--test-cmd` argument or default `pnpm test --run` — injected by implement.sh | Both |
+| `{{CODEBASE_CONTEXT}}` | From spec.json `implementationContext` — injected by implement.sh | Both |
+| `{{DIFF}}` | Cleaned git diff (full if small, stat tree if large) — injected by implement.sh each iteration | Both |
 
 ---
 
@@ -69,6 +70,10 @@ Implement the story. **One story per iteration** — keep changes focused.
 
 {{CODEBASE_CONTEXT}}
 
+**Changes under test:**
+
+{{DIFF}}
+
 ### 5. Verify quality
 
 Run these commands — ALL must pass:
@@ -169,6 +174,10 @@ Implement the story. **One story per iteration** — keep changes focused.
 
 {{CODEBASE_CONTEXT}}
 
+**Changes under test:**
+
+{{DIFF}}
+
 ### 4. Verify quality
 
 Run these commands — ALL must pass:
diff --git a/plugins/eng/skills/qa/SKILL.md b/plugins/eng/skills/qa/SKILL.md
index b4b60c03..62b8735c 100644
--- a/plugins/eng/skills/qa/SKILL.md
+++ b/plugins/eng/skills/qa/SKILL.md
@@ -10,36 +10,19 @@ You are a QA engineer. Your job is to verify that a feature works the way a real
 
 A feature can pass every unit test and still have a broken layout, a confusing flow, an API that returns the wrong status code, or an interaction that doesn't feel right. Your job is to find those problems before anyone else does.
 
-**Posture: exhaust your tools.** Do not stop at the first level of verification that seems sufficient. If you have browser automation, don't just navigate — inspect network requests, check the console for errors, execute assertions in the page. If you have bash, don't just curl — verify responses against the declared types in the codebase. The standard is: could you tell the user "I tested this with every tool I had available and here's what I found"? If not, you haven't tested enough.
+**Posture: exhaust your tools.** Do not stop at the first level of verification that seems sufficient. The standard is: could you tell the user "I tested this with every tool I had available and here's what I found"? If not, you haven't tested enough.
 
 **Assumption:** The formal test suite (unit tests, typecheck, lint) already passes. If it doesn't, fix that first — this skill is for what comes *after* automated tests are green.
 
 ---
 
-## Workflow
+## Orchestrator Workflow
 
-### Step 1: Detect available tools
+You are the orchestrator. Your job is to: (1) understand what to test, (2) generate a structured test plan as `qa.json`, (3) craft the iteration prompt, (4) run the subprocess loop, and (5) report results.
 
-Probe what testing tools are available. This determines your testing surface area.
+### Phase 1: Gather context and detect tools
 
-| Capability | How to detect | Use for | If unavailable |
-|---|---|---|---|
-| **Shell / CLI** | Always available | API calls (`curl`), CLI verification, data validation, database state checks, process behavior, file/log inspection | — |
-| **Browser automation** | Check if browser interaction tools are accessible | UI testing, form flows, visual verification, full user journey walkthrough, error state rendering, layout audit | Substitute with shell-based API/endpoint testing. Document: "UI not visually verified." |
-| **Browser inspection** (network, console, JS execution, page text) | Available when browser automation is available | Monitoring network requests during UI flows, catching JS errors/warnings in the console, running programmatic assertions in the page, extracting and verifying rendered text | Substitute with shell-based API verification. Document the gap. |
-| **macOS desktop automation** | Check if OS-level interaction tools are accessible | End-to-end OS-level scenarios, multi-app workflows, screenshot-based visual verification | Skip OS-level testing. Document the gap. |
-
-Record what's available. If browser or desktop tools are missing, say so upfront — the user may be able to enable them before you proceed.
-
-**Probe aggressively.** Don't stop at "browser automation is available." Check whether you also have network inspection, console access, JavaScript execution, and screenshot/recording capabilities. Each expands your testing surface area. The more tools you have, the more you should use.
-
-**Cross-skill integration:** When browser automation is available, `Load /browser skill` for structured testing primitives. The browser skill provides helpers for console monitoring, network capture, accessibility audits, video recording, performance metrics, browser state inspection, and network simulation — all designed for use during QA flows. These helpers turn "check the console for errors" into reliable, automatable verification with structured output. Reference `/browser` SKILL.md for the full helper table and usage patterns.
-
-**Get the system running.** Check `AGENTS.md`, `CLAUDE.md`, or similar repo configuration files for build, run, and setup instructions. If the software can be started locally, start it — you cannot test user-facing behavior against a system that isn't running. If the system depends on external services, databases, or environment variables, check what's available and what you can reach. Document anything you cannot start.
-
-### Step 2: Gather context — what are you testing?
-
-Determine what to test from whatever input is available. Check these sources in order; use the first that gives you enough to derive test scenarios:
+**Gather context** — Determine what to test from whatever input is available:
 
 | Input | How to use it |
 |---|---|
@@ -48,166 +31,149 @@ Determine what to test from whatever input is available. Check these sources in
 | **Feature description provided** | Use it as-is. Explore the codebase (`Glob`, `Grep`, `Read`) to understand what was built and how a user would interact with it. |
 | **"Test what changed"** (or no input) | Run `git diff main...HEAD --stat` to see what files changed. Read the changed files. Infer the feature surface area and user-facing impact. |
 
-**Output of this step:** A mental model of what was built, who uses it, and how they interact with it.
+**Detect available tools** — Probe once for what testing capabilities are available. Record the results — iteration agents will use this directly without re-probing.
 
-### Step 3: Derive the test plan
+| Tool | How to detect | Value for `availableTools` |
+|---|---|---|
+| Shell / CLI | Always available | `"cli"` |
+| Browser automation | Check if browser interaction tools are accessible (Claude in Chrome, Playwright) | `"browser"` |
+| Browser inspection | Available when browser automation is available (network, console, JS execution) | `"browser-inspection"` |
+| macOS desktop automation | Check if OS-level interaction tools are accessible (Peekaboo MCP) | `"macos"` |
+
+**Determine system startup** — Check `AGENTS.md`, `CLAUDE.md`, or similar repo configuration files for build, run, and setup instructions. If the software can be started locally, document the startup command. This becomes `testContext` in qa.json.
+
+**Output:** A mental model of what was built + the detected `availableTools` list + system startup instructions (`testContext`).
+
+### Phase 2: Generate qa.json
+
+**Load:** `references/testing-guidance.md` — use the scenario categories and formalization gate to derive well-structured scenarios.
+
+From the context gathered in Phase 1, derive concrete test scenarios. For each candidate, apply the **formalization gate**:
+
+> **"Could this be a formal test?"** If yes with easy-to-medium effort — stop. Write that test instead (or flag it). Only proceed with scenarios that genuinely resist automation.
+
+Write each scenario to `qa.json` with:
+- `id`: Sequential `QA-001`, `QA-002`, etc.
+- `title`: Short, descriptive name
+- `description`: What to test and how
+- `category`: One of: `visual`, `e2e-flow`, `error-state`, `edge-case`, `integration`, `api-contract`, `usability`, `failure-mode`
+- `tools`: Which tools are required (must be a subset of `availableTools`)
+- `successCriteria`: Exact pass/fail checklist — each criterion is a discrete verification point
+- `priority`: Sequential 1..N (test order — happy path first, then error states, then edge cases)
+- `passes`: `false` (initial)
+- `result`: `""` (initial)
+- `notes`: `""` (initial)
+
+**qa.json schema:**
+
+```json
+{
+  "project": "project name",
+  "branchName": "current git branch",
+  "description": "what is being tested",
+  "testContext": "how to start the system, URLs, environment setup",
+  "availableTools": ["cli", "browser", "browser-inspection"],
+  "scenarios": [
+    {
+      "id": "QA-001",
+      "title": "Login flow — happy path",
+      "description": "Navigate to login page, enter valid credentials, verify redirect to dashboard",
+      "category": "e2e-flow",
+      "tools": ["browser"],
+      "successCriteria": [
+        "Login page loads without JS errors",
+        "Valid credentials result in redirect to /dashboard",
+        "Dashboard shows user's name in header"
+      ],
+      "priority": 1,
+      "passes": false,
+      "result": "",
+      "notes": ""
+    }
+  ]
+}
+```
 
-From the context gathered in Step 2, identify concrete scenarios that require manual verification. For each candidate scenario, apply the **formalization gate**:
+**Validate:** Run `bun scripts/validate-qa.ts tmp/ship/qa.json` to verify schema integrity. Fix any errors before proceeding.
 
-> **"Could this be a formal test?"** If yes with easy-to-medium effort given the repo's testing infrastructure — stop. Write that test instead (or flag it to the user). Only proceed with scenarios that genuinely resist automation.
+### Phase 3: Template assembly (handled by qa.sh)
 
-**Scenarios that belong in the QA plan:**
+The `qa.sh` script handles all template assembly automatically. It reads `templates/qa-prompt.template.md`, selects the correct variant based on `--spec-path`, extracts `testContext` and `codebaseContext` from qa.json, injects `references/testing-guidance.md`, computes a fresh git diff each iteration, and fills all `{{PLACEHOLDERS}}`.
 
-| Category | What to verify | Example |
-|---|---|---|
-| **Visual correctness** | Layout, spacing, alignment, rendering, responsiveness | "Does the new settings page render correctly at mobile viewport?" |
-| **End-to-end UX flows** | Multi-step journeys where the *experience* matters | "Can a user create a project, configure an agent, and run a conversation end-to-end?" |
-| **Subjective usability** | Does the flow make sense? Labels clear? Error messages helpful? | "When auth fails, does the error message tell the user what to do next?" |
-| **Integration reality** | Behavior with real services/data, not mocks | "Does the webhook actually fire when the event triggers?" |
-| **Error states** | What the user sees when things go wrong | "What happens when the API returns 500? Does the UI show a useful error or a blank page?" |
-| **Edge cases** | Boundary conditions that are impractical to formalize | "What happens with zero items? With 10,000 items? With special characters in the name?" |
-| **Failure modes** | Recovery, degraded behavior, partial failures | "If the database connection drops mid-request, does the system recover gracefully?" |
-| **Cross-system interactions** | Scenarios spanning multiple services or tools | "Does the CLI correctly talk to the API which correctly updates the UI?" |
+**Your job in Phase 2:** Ensure qa.json has rich `codebaseContext` and `testContext` fields. The codebase context becomes the iteration agent's reference for architecture, key patterns, and conventions. The test context tells the agent how to start and access the system under test.
 
-Write each scenario as a discrete test case:
-1. **What you will do** (the action)
-2. **What "pass" looks like** (expected outcome)
-3. **Why it's not a formal test** (justification)
+**You do NOT need to:** read the template, select a variant, fill placeholders, or save a prompt file. The script does all of this.
 
-Create these as task list items to track execution progress.
+### Phase 4: Execute the loop
 
-### Step 4: Persist the QA checklist
+Run the iteration loop. Pass `--spec-path` when a SPEC.md is available (selects Variant A of the prompt template):
 
-If a PR exists, write the QA checklist to the `## Test plan` section of the PR body. **Always update via `gh pr edit --body` — never post QA results as PR comments.**
+```bash
+bash scripts/qa.sh --spec tmp/ship/qa.json --spec-path <SPEC_PATH> --force
+```
 
-**Update mechanism:**
+Omit `--spec-path` when no SPEC.md is available (selects Variant B).
 
-1. Read the current PR body: `gh pr view <number> --json body -q '.body'`
-2. If a `## Test plan` section already exists, replace its content with the updated checklist.
-3. If no such section exists, append it to the end of the body.
-4. Write the updated body back: `gh pr edit <number> --body "<updated body>"`
+The loop spawns fresh Claude Code subprocesses per iteration. Each subprocess:
+1. Reads qa.json for the next incomplete scenario
+2. Executes the test using available tools
+3. Records results back to qa.json
+4. Logs progress to qa-progress.txt
+5. Signals completion when all scenarios pass
 
-Section format:
+Monitor the loop output. If it exits with max iterations reached, check qa.json for incomplete scenarios and qa-progress.txt for blockers.
 
-```md
-## Test plan
+### Phase 5: Report
 
-_Manual QA scenarios that resist automation. Updated as tests complete._
+Read `tmp/ship/qa.json` for final status. Compute:
+- Total scenarios
+- Pass count (result = "pass")
+- Fixed count (result = "fail-fixed")
+- Blocked count (result = "blocked")
+- Skipped count (result = "skipped")
 
-- [ ] **<category>: <scenario name>** — <what you'll verify> · _Why not a test: <reason>_
-```
+**If a PR exists:** Update the `## Test plan` section of the PR body.
 
-If no PR exists, maintain the checklist as task list items only.
-
-### Step 5: Execute — test like a human would
-
-Work through each scenario. Use the strongest tool available for each.
-
-**Testing priority: emulate real users first.** Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools (curl, direct database queries) for parts of the system that are not user-facing or when higher-fidelity tools are unavailable. For parts of the system touched by the changes but not visible to the customer — use server-side observability (logs, telemetry, database state) to verify correctness beneath the surface.
-
-**Unblock yourself with ad-hoc scripts.** Do not wait for formal test infrastructure, published packages, or CI pipelines. If you need to verify something, write a quick script and run it. Put all throwaway artifacts — scripts, fixtures, test data, temporary configs — in a `tmp/` directory at the repo root (typically gitignored). These are disposable; they don't need to be production-quality. Specific patterns:
-- **Quick verification scripts:** Write a script that imports a module, calls a function, and asserts the output. Run it. Delete it when done (or leave it in `tmp/`).
-- **Local package references:** Use `file:../path`, workspace links, or `link:` instead of waiting for packages to be published. Test the code as it exists on disk.
-- **Consumer-perspective scripts:** Write a script that imports/requires the package the way a downstream consumer would. Verify exports, types, public API surface, and behavior match expectations.
-- **REPL exploration:** Use a REPL (node, python, etc.) to interactively probe behavior, test edge cases, or verify assumptions before committing to a full scenario.
-- **Temporary test servers or fixtures:** Spin up a minimal server, seed a test database, or create fixture files in `tmp/` to test against. Tear them down when done.
-- **Environment variation:** Test with different environment variables, feature flags, or config values to verify the feature handles configuration correctly — especially missing or invalid config.
-
-**With browser automation:**
-- Navigate to the feature. Click through it. Fill forms. Submit them.
-- Walk the full user journey end-to-end — don't just verify individual pages.
-- Audit visual layout — does it look right? Is anything misaligned, clipped, or missing?
-- Test error states — submit invalid data, disconnect, trigger edge cases.
-- Test at different viewport sizes if the feature is responsive.
-- Test keyboard navigation and focus management.
-- Record a GIF of multi-step flows when it helps demonstrate the result.
-
-**With browser inspection (use alongside browser automation — not instead of):**
-- **Console monitoring (non-negotiable — do this on every flow):** Start capture BEFORE navigating (`startConsoleCapture`), then check for errors after each major action (`getConsoleErrors`). A page that looks correct but throws JS errors is not correct. Filter logs for specific patterns (`getConsoleLogs` with string/RegExp/function filter) when diagnosing issues.
-- **Network request verification:** Start capture BEFORE navigating (`startNetworkCapture` with URL filter like `'/api/'`). After the flow, check for failed requests (`getFailedRequests` — catches 4xx, 5xx, and connection failures). Verify: correct endpoints called, status codes expected, no silent failures. For specific API calls, use `waitForApiResponse` to assert status and inspect response body/JSON.
-- **Browser state verification:** After mutations, verify state was persisted correctly. Check `getLocalStorage`, `getSessionStorage`, `getCookies` to confirm the UI action actually wrote expected data. Use `clearAllStorage` between test scenarios for clean-state testing.
-- **In-page assertions:** Execute JavaScript in the page to verify DOM state, computed styles, data attributes, or application state that isn't visible on screen. Use `getElementBounds` for layout verification (visibility, viewport presence, computed styles). Use this when visual inspection alone can't confirm correctness (e.g., "is this element actually hidden via CSS, or just scrolled off-screen?").
-- **Rendered text verification:** Extract page text to verify content rendering — especially dynamic content, interpolated values, and conditional text.
-
-**With browser-based quality signals (when /browser primitives are available):**
-- **Accessibility audit:** Run `runAccessibilityAudit` on each major page/view. Report WCAG violations by impact level (critical > serious > moderate). Test keyboard focus order with `checkFocusOrder` — verify tab navigation follows logical reading order, especially on new or changed UI.
-- **Performance baseline:** After page load, capture `capturePerformanceMetrics` to check for obvious regressions — TTFB, FCP, LCP, CLS. You're not doing formal perf testing; you're catching "this page takes 8 seconds to load" or "layout shifts when the hero image loads."
-- **Video recording:** For complex multi-step flows, record with `createVideoContext`. Attach recordings to QA results as evidence. Especially useful for flows that involve timing, animations, or state transitions that are hard to capture in a screenshot.
-- **Responsive verification:** Run `captureResponsiveScreenshots` to sweep standard breakpoints (mobile/tablet/desktop/wide). Compare screenshots for layout breakage, clipping, or missing elements across viewports.
-- **Degraded conditions:** Test with `simulateSlowNetwork` (e.g., 500ms latency) and `blockResources` (block images/fonts) to verify graceful degradation. Test `simulateOffline` if the feature has offline handling. These helpers compose with `page.route()` mocks via `route.fallback()`.
-- **Dialog handling:** Use `handleDialogs` before navigating to auto-accept/dismiss alerts, confirms, and prompts — then inspect `captured.dialogs` to verify the right dialogs fired. Use `dismissOverlays` to auto-dismiss cookie banners and consent popups that block interaction during test flows.
-- **Page structure discovery:** Use `getPageStructure` to get the accessibility tree with suggested selectors. Useful for verifying ARIA roles, element discoverability, and building selectors for unfamiliar pages. Pass `{ interactiveOnly: true }` to focus on actionable elements.
-- **Tracing:** Use `startTracing`/`stopTracing` to capture a full Playwright trace (.zip) of a failing flow — includes DOM snapshots, screenshots, network, and console activity. View with `npx playwright show-trace`.
-- **PDF & download verification:** Use `generatePdf` to verify PDF export features. Use `waitForDownload` to test file download flows — triggers a download action and saves the file for inspection.
-
-**With macOS desktop automation:**
-- Test OS-level interactions when relevant — file dialogs, clipboard, multi-app workflows.
-- Take screenshots for visual verification.
-
-**With shell / CLI (always available):**
-- `curl` API endpoints. Verify status codes, response shapes, error responses.
-- **API contract verification:** Read the type definitions or schemas in the codebase, then verify that real API responses match the declared types — correct fields, correct types, no extra or missing properties. This catches drift between types and runtime behavior.
-- Test CLI commands with valid and invalid input.
-- Verify file outputs, logs, process behavior.
-- Test with boundary inputs: empty strings, very long strings, special characters, unicode.
-- Test concurrent operations if relevant: can two requests race?
-
-**Data integrity verification (after any mutation):**
-- Before the mutation: record the relevant state (database row, file contents, API response).
-- Perform the mutation via the UI or API.
-- After the mutation: verify the state changed correctly — right values written, no unintended side effects on related data, timestamps/audit fields updated.
-- This catches mutations that appear to succeed (200 OK, UI updates) but write wrong values, miss fields, or corrupt related state.
-
-**Server-side observability (when available):**
-Changes touch more of the system than what's visible to the user. After exercising user-facing flows, check server-side signals for problems that wouldn't surface in the browser or API response.
-- **Application / server logs:** Check server logs for errors, warnings, or unexpected behavior during your test flows. Tail logs while running browser or API tests.
-- **Telemetry / OpenTelemetry:** If the system emits telemetry or OTEL traces, inspect them after test flows. Verify: traces are emitted for the expected operations, spans have correct attributes, no error spans where success is expected.
-- **Database state:** Query the database directly to verify mutations wrote correct values — especially when the API or UI reports success but the actual persistence could differ.
-- **Background jobs / queues:** If the feature triggers async work (queues, cron, webhooks), verify the jobs were enqueued and completed correctly.
-
-**General testing approach:**
-1. Start from a clean state (no cached data, fresh session).
-2. Walk the happy path first — end-to-end as the spec describes.
-3. Then break it — try every failure mode you identified.
-4. Then stress it — boundary conditions, unexpected inputs, concurrent access.
-5. Then look at it — visual correctness, usability, "does this feel right?"
-
-### Step 6: Record results
-
-After each scenario (or batch of related scenarios), update the `## Test plan` section in the PR body using the same read → modify → write mechanism from Step 4. **The checklist in the PR body is the single source of truth — do not post results as PR comments.**
-
-| Result | How to record |
-|---|---|
-| **Pass** | Check the box: `- [x]` |
-| **Fail → fixed** | Check the box, append: `— Fixed: <what was wrong and how>` |
-| **Fail → blocked** | Leave unchecked, append: `— BLOCKED: <what went wrong, why unresolvable>` |
-| **Skipped (tool limitation)** | Leave unchecked, append: `— Skipped: <reason, e.g., no browser automation>` |
+1. Read the current PR body: `gh pr view <number> --json body -q '.body'`
+2. If a `## Test plan` section exists, replace its content. Otherwise append it.
+3. Write the updated body: `gh pr edit <number> --body "<updated body>"`
 
-**When you find a bug:**
+Section format:
 
-First, assess: do you see the root cause, or just the symptom?
+```md
+## Test plan
 
-- **Root cause is obvious** (wrong variable, missing class, off-by-one visible in the code) — fix it directly. Write a test if possible, verify, document.
-- **Root cause is unclear** (unexpected behavior, cause not visible from the symptom) — load `/debug` for systematic root cause investigation before attempting a fix. QA resumes after the fix is verified.
+_QA scenarios — verified via subprocess iteration loop._
 
-### Step 7: Report
+- [x] **e2e-flow: Login flow — happy path** — pass
+- [x] **error-state: Invalid credentials** — fail-fixed: missing error message, added in [qa-fix] commit
+- [ ] **visual: Mobile responsive layout** — blocked: browser automation unavailable
+- [x] **edge-case: Empty project name** — skipped: requires macOS automation
 
-**If a PR exists:** The `## Test plan` section in the PR body is your primary report. Ensure it's up-to-date with all results (pass/fail/fixed/blocked/skipped). Do not add a separate PR comment — the PR body section is the report.
+**Summary:** N/M scenarios passing. K bugs found and fixed. J gaps documented.
+```
 
 **If no PR exists:** Report directly to the user with:
-
 - Total scenarios tested vs. passed vs. failed vs. skipped
 - Bugs found and fixed (with brief description of each)
 - Gaps — what could NOT be tested due to tool limitations or environment constraints
 - Judgment call — your honest assessment: is this feature ready for human review?
 
-The skill's job is to fix what it can, document what it found, and hand back a clear picture. Unresolvable issues and gaps are documented, not silently swallowed — but they do not block forward progress. The invoker (user or /ship) decides what to do about remaining items.
+---
+
+## State files
+
+| File | What it holds | Created | Updated | Read by |
+|---|---|---|---|---|
+| `tmp/ship/qa.json` | Test scenarios — success criteria, tools, priority, pass/fail status, results | Phase 2 (orchestrator) | Each iteration (sets `passes`, `result`, `notes`) | qa.sh, iterations, orchestrator |
+| `tmp/ship/qa-progress.txt` | Iteration log — what was tested, findings, learnings, blockers | Phase 4 start (qa.sh) | Each iteration (append) | Iterations, orchestrator |
+| `tmp/ship/qa-prompt.md` | Crafted iteration prompt (from template) | Phase 3 (orchestrator) | Not updated after creation | qa.sh (passed to each subprocess) |
 
 ---
 
 ## Calibrating depth to risk
 
-Not every feature needs deep QA. Match effort to risk:
+Not every feature needs deep QA. Match scenario count and depth to risk:
 
 | What changed | Testing depth |
 |---|---|
@@ -217,10 +183,6 @@ Not every feature needs deep QA. Match effort to risk:
 | Glue code, config, pass-through | Light — verify it connects correctly. Don't over-test plumbing. |
 | Performance-sensitive paths | Targeted — benchmark the specific path if tools allow |
 
-**Over-testing looks like:** Manually verifying things already covered by passing unit tests. Clicking through UIs that haven't changed. Testing framework behavior instead of feature behavior.
-
-**Under-testing looks like:** Declaring confidence from unit tests alone when the feature has user-facing surfaces. Skipping error-path testing. Not testing the interaction between new and existing code. Never opening the UI.
-
 ---
 
 ## Anti-patterns
diff --git a/plugins/eng/skills/qa/references/testing-guidance.md b/plugins/eng/skills/qa/references/testing-guidance.md
new file mode 100644
index 00000000..fc2fdcb5
--- /dev/null
+++ b/plugins/eng/skills/qa/references/testing-guidance.md
@@ -0,0 +1,170 @@
+Use when: Generating qa.json scenarios, injecting into QA iteration prompts, calibrating test depth
+Priority: P0
+Impact: Without this, scenarios lack structure, testing is shallow, tool capabilities underused
+
+---
+
+# Testing Guidance
+
+This is the authoritative reference for *how to test*. It covers tool capabilities, scenario categories, execution patterns, bug handling, and depth calibration.
+
+The orchestrator uses this to generate well-structured qa.json scenarios. The full content is injected into the iteration prompt so each subprocess has the complete testing methodology.
+
+---
+
+## Tool capabilities
+
+| Capability | Use for | If unavailable |
+|---|---|---|
+| **Shell / CLI** (always available) | API calls (`curl`), CLI verification, data validation, database state checks, process behavior, file/log inspection | — |
+| **Browser automation** | UI testing, form flows, visual verification, full user journey walkthrough, error state rendering, layout audit | Substitute with shell-based API/endpoint testing. Document: "UI not visually verified." |
+| **Browser inspection** (network, console, JS execution, page text) | Monitoring network requests during UI flows, catching JS errors/warnings in the console, running programmatic assertions in the page, extracting and verifying rendered text | Substitute with shell-based API verification. Document the gap. |
+| **macOS desktop automation** | End-to-end OS-level scenarios, multi-app workflows, screenshot-based visual verification | Skip OS-level testing. Document the gap. |
+
+**Posture: exhaust your tools.** Do not stop at the first level of verification that seems sufficient. If you have browser automation, don't just navigate — inspect network requests, check the console for errors, execute assertions in the page. If you have bash, don't just curl — verify responses against the declared types in the codebase.
+
+**Cross-skill integration:** When browser automation is available, `Load /browser skill` for structured testing primitives. The browser skill provides helpers for console monitoring, network capture, accessibility audits, video recording, performance metrics, browser state inspection, and network simulation.
+
+---
+
+## Scenario categories
+
+When deriving test scenarios, apply the **formalization gate** to each candidate:
+
+> **"Could this be a formal test?"** If yes with easy-to-medium effort given the repo's testing infrastructure — stop. Write that test instead (or flag it). Only proceed with scenarios that genuinely resist automation.
+
+| Category | What to verify | Example |
+|---|---|---|
+| **visual** | Layout, spacing, alignment, rendering, responsiveness | "Does the new settings page render correctly at mobile viewport?" |
+| **e2e-flow** | Multi-step journeys where the *experience* matters | "Can a user create a project, configure an agent, and run a conversation end-to-end?" |
+| **usability** | Does the flow make sense? Labels clear? Error messages helpful? | "When auth fails, does the error message tell the user what to do next?" |
+| **integration** | Behavior with real services/data, not mocks | "Does the webhook actually fire when the event triggers?" |
+| **error-state** | What the user sees when things go wrong | "What happens when the API returns 500? Does the UI show a useful error or a blank page?" |
+| **edge-case** | Boundary conditions that are impractical to formalize | "What happens with zero items? With 10,000 items? With special characters in the name?" |
+| **failure-mode** | Recovery, degraded behavior, partial failures | "If the database connection drops mid-request, does the system recover gracefully?" |
+| **api-contract** | API responses match declared types/schemas | "Does the CLI correctly talk to the API which correctly updates the UI?" |
+
+Each scenario must have:
+1. **What you will do** (the action)
+2. **What "pass" looks like** (expected outcome — becomes `successCriteria`)
+3. **Why it's not a formal test** (justification)
+
+---
+
+## Execution patterns
+
+### Testing priority
+
+**Emulate real users first.** Prefer tools that replicate how a user actually interacts with the system. Browser automation over API calls. SDK/client library calls over raw HTTP. Real user journeys over isolated endpoint checks. Fall back to lower-fidelity tools for parts that are not user-facing or when higher-fidelity tools are unavailable.
+
+### General testing approach
+
+1. Start from a clean state (no cached data, fresh session).
+2. Walk the happy path first — end-to-end as the spec describes.
+3. Then break it — try every failure mode you identified.
+4. Then stress it — boundary conditions, unexpected inputs, concurrent access.
+5. Then look at it — visual correctness, usability, "does this feel right?"
+
+### Ad-hoc scripts
+
+Do not wait for formal test infrastructure. If you need to verify something, write a quick script and run it. Put throwaway artifacts in `tmp/` (typically gitignored). Patterns:
+- **Quick verification scripts:** Import a module, call a function, assert the output.
+- **Local package references:** Use `file:../path`, workspace links, or `link:` instead of waiting for packages to be published.
+- **Consumer-perspective scripts:** Import/require the package the way a downstream consumer would. Verify exports, types, public API surface.
+- **REPL exploration:** Use a REPL to interactively probe behavior and test edge cases.
+- **Temporary test servers or fixtures:** Spin up a minimal server, seed a test database, create fixture files in `tmp/`.
+- **Environment variation:** Test with different environment variables, feature flags, or config values.
+
+### Browser automation
+
+- Navigate to the feature. Click through it. Fill forms. Submit them.
+- Walk the full user journey end-to-end — don't just verify individual pages.
+- Audit visual layout — does it look right? Is anything misaligned, clipped, or missing?
+- Test error states — submit invalid data, disconnect, trigger edge cases.
+- Test at different viewport sizes if the feature is responsive.
+- Test keyboard navigation and focus management.
+- Record a GIF of multi-step flows when it helps demonstrate the result.
+
+### Browser inspection (use alongside browser automation — not instead of)
+
+- **Console monitoring (non-negotiable — do this on every flow):** Start capture BEFORE navigating (`startConsoleCapture`), then check for errors after each major action (`getConsoleErrors`). A page that looks correct but throws JS errors is not correct.
+- **Network request verification:** Start capture BEFORE navigating (`startNetworkCapture` with URL filter like `'/api/'`). After the flow, check for failed requests (`getFailedRequests`). Verify: correct endpoints called, status codes expected, no silent failures.
+- **Browser state verification:** After mutations, verify state was persisted correctly. Check `getLocalStorage`, `getSessionStorage`, `getCookies`.
+- **In-page assertions:** Execute JavaScript in the page to verify DOM state, computed styles, data attributes, or application state not visible on screen.
+- **Rendered text verification:** Extract page text to verify content rendering — especially dynamic content, interpolated values, and conditional text.
+
+### Browser-based quality signals (when /browser primitives are available)
+
+- **Accessibility audit:** Run `runAccessibilityAudit` on each major page/view. Report WCAG violations by impact level.
+- **Performance baseline:** Capture `capturePerformanceMetrics` to check for obvious regressions — TTFB, FCP, LCP, CLS.
+- **Video recording:** For complex multi-step flows, record with `createVideoContext`.
+- **Responsive verification:** Run `captureResponsiveScreenshots` to sweep standard breakpoints.
+- **Degraded conditions:** Test with `simulateSlowNetwork` and `blockResources` to verify graceful degradation.
+- **Dialog handling:** Use `handleDialogs` before navigating. Use `dismissOverlays` for cookie banners.
+- **Page structure discovery:** Use `getPageStructure` for accessibility tree and selectors.
+- **Tracing:** Use `startTracing`/`stopTracing` for a full Playwright trace of failing flows.
+- **PDF & download verification:** Use `generatePdf` and `waitForDownload` for file-based features.
+
+### macOS desktop automation
+
+- Test OS-level interactions when relevant — file dialogs, clipboard, multi-app workflows.
+- Take screenshots for visual verification.
+
+### Shell / CLI (always available)
+
+- `curl` API endpoints. Verify status codes, response shapes, error responses.
+- **API contract verification:** Read type definitions/schemas in the codebase, then verify real API responses match declared types.
+- Test CLI commands with valid and invalid input.
+- Verify file outputs, logs, process behavior.
+- Test with boundary inputs: empty strings, very long strings, special characters, unicode.
+- Test concurrent operations if relevant: can two requests race?
+
+### Data integrity verification (after any mutation)
+
+- Before the mutation: record the relevant state.
+- Perform the mutation via UI or API.
+- After the mutation: verify the state changed correctly — right values written, no unintended side effects, timestamps/audit fields updated.
+
+### Server-side observability (when available)
+
+- **Application / server logs:** Check for errors, warnings, or unexpected behavior during test flows.
+- **Telemetry / OpenTelemetry:** Inspect traces after flows. Verify spans and attributes.
+- **Database state:** Query directly to verify mutations wrote correct values.
+- **Background jobs / queues:** Verify async work was enqueued and completed correctly.
+
+---
+
+## Bug handling
+
+When you find a bug, assess: do you see the root cause, or just the symptom?
+
+- **Root cause is obvious** (wrong variable, missing class, off-by-one visible in the code) — fix it directly. Write a test if possible, verify, document.
+- **Root cause is unclear** (unexpected behavior, cause not visible from the symptom) — load `/debug` for systematic root cause investigation before attempting a fix. QA resumes after the fix is verified.
+
+---
+
+## Calibrating depth to risk
+
+Not every feature needs deep QA. Match effort to risk:
+
+| What changed | Testing depth |
+|---|---|
+| New user-facing feature (UI, API, CLI) | Deep — full journey walkthrough, error states, visual audit, edge cases |
+| Business logic, data mutations, auth/permissions | Deep — verify behavior matches spec, test failure modes thoroughly |
+| Bug fix | Targeted — verify the fix, test the regression path, check for side effects |
+| Glue code, config, pass-through | Light — verify it connects correctly. Don't over-test plumbing. |
+| Performance-sensitive paths | Targeted — benchmark the specific path if tools allow |
+
+**Over-testing looks like:** Manually verifying things already covered by passing unit tests. Clicking through UIs that haven't changed. Testing framework behavior instead of feature behavior.
+
+**Under-testing looks like:** Declaring confidence from unit tests alone when the feature has user-facing surfaces. Skipping error-path testing. Not testing the interaction between new and existing code. Never opening the UI.
+
+---
+
+## Anti-patterns
+
+- **Treating QA as a checkbox.** "I tested it" means nothing without specifics. Every scenario must have a concrete action and expected outcome.
+- **Only testing the happy path.** Real users encounter errors, edge cases, and unexpected states. Test those.
+- **Duplicating formal tests.** If the test suite already covers it, don't repeat it manually. Your time is for what the test suite *can't* do.
+- **Skipping tools that are available.** If browser automation is available and the feature has a UI — use it. Don't substitute with curl when you can click through the real thing.
+- **Silent gaps.** If you can't test something, say so explicitly. An undocumented gap is worse than a documented one.
diff --git a/plugins/eng/skills/qa/scripts/qa.sh b/plugins/eng/skills/qa/scripts/qa.sh
new file mode 100755
index 00000000..70410fe2
--- /dev/null
+++ b/plugins/eng/skills/qa/scripts/qa.sh
@@ -0,0 +1,376 @@
+#!/bin/bash
+set -e
+
+# QA Iteration Loop
+# Spawns independent Claude Code processes to execute test scenarios from qa.json.
+# Each iteration is a fresh process with no memory — state persists via files.
+#
+# This script also assembles the iteration prompt from the template by injecting:
+#   - {{TESTING_GUIDANCE}} from references/testing-guidance.md
+#   - {{TEST_CONTEXT}} from qa.json.testContext
+#   - {{CODEBASE_CONTEXT}} from qa.json.codebaseContext
+#   - {{SPEC_PATH}} from --spec-path argument
+#   - {{DIFF}} cleaned git diff (full or stat tree depending on size)
+
+# --- Ship directory (configurable via CLAUDE_SHIP_DIR env var) ---
+SHIP_DIR="${CLAUDE_SHIP_DIR:-tmp/ship}"
+
+# --- Defaults ---
+MAX_ITERATIONS=15
+MAX_TURNS=50
+TEMPLATE_FILE=""
+QA_FILE="$SHIP_DIR/qa.json"
+PROGRESS_FILE="$SHIP_DIR/qa-progress.txt"
+PROMPT_FILE="$SHIP_DIR/qa-prompt.md"
+DIFF_FILE="$SHIP_DIR/qa-diff.txt"
+SPEC_PATH=""
+FORCE=false
+
+# ~5-10K tokens ≈ 30K characters
+DIFF_MAX_CHARS=30000
+
+# --- Script paths ---
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+VALIDATE_SCRIPT="$SCRIPT_DIR/validate-qa.ts"
+TESTING_GUIDANCE="$SCRIPT_DIR/../references/testing-guidance.md"
+
+# --- Noise filters (same as ship-stop-hook.sh) ---
+FILTER_LOCK='(pnpm-lock\.yaml|package-lock\.json|yarn\.lock|\.lock$)'
+FILTER_SHIP='^.. tmp/ship/'
+FILTER_BUILD='(dist|\.next|build|\.turbo|node_modules)/'
+
+# --- Colors ---
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+# --- Usage ---
+usage() {
+    echo "Usage: qa.sh [OPTIONS]"
+    echo ""
+    echo "Options:"
+    echo "  --max-iterations N   Max iteration loops (default: 15)"
+    echo "  --max-turns N        Max agentic turns per iteration (default: 50)"
+    echo "  --template FILE      Prompt template file (auto-detected from skill if omitted)"
+    echo "  --spec FILE          QA JSON file path (default: $SHIP_DIR/qa.json)"
+    echo "  --spec-path PATH     Path to SPEC.md (for template variant A)"
+    echo "  --force              Skip uncommitted changes prompt"
+    echo "  -h, --help           Show this help"
+    exit 0
+}
+
+# --- Parse arguments ---
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --max-iterations) MAX_ITERATIONS="$2"; shift 2 ;;
+        --max-turns) MAX_TURNS="$2"; shift 2 ;;
+        --template) TEMPLATE_FILE="$2"; shift 2 ;;
+        --spec) QA_FILE="$2"; shift 2 ;;
+        --spec-path) SPEC_PATH="$2"; shift 2 ;;
+        --force) FORCE=true; shift ;;
+        -h|--help) usage ;;
+        *) echo "Unknown option: $1"; usage ;;
+    esac
+done
+
+# --- Auto-detect template if not provided ---
+if [[ -z "$TEMPLATE_FILE" ]]; then
+    TEMPLATE_FILE="$SCRIPT_DIR/../templates/qa-prompt.template.md"
+fi
+
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo -e "${BLUE}  QA Iteration Loop${NC}"
+echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+
+# --- Get current branch ---
+CURRENT_BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
+
+echo -e "${YELLOW}Branch:${NC}          $CURRENT_BRANCH"
+echo -e "${YELLOW}Max iterations:${NC}  $MAX_ITERATIONS"
+echo -e "${YELLOW}Max turns:${NC}       $MAX_TURNS"
+echo -e "${YELLOW}Template:${NC}        $TEMPLATE_FILE"
+echo -e "${YELLOW}QA JSON:${NC}         $QA_FILE"
+echo ""
+
+# --- Check required files ---
+if [[ ! -f "$TEMPLATE_FILE" ]]; then
+    echo -e "${RED}Error: Template file not found: $TEMPLATE_FILE${NC}"
+    exit 1
+fi
+
+if [[ ! -f "$QA_FILE" ]]; then
+    echo -e "${RED}Error: QA JSON file not found: $QA_FILE${NC}"
+    echo "Run /qa to generate qa.json first."
+    exit 1
+fi
+
+if [[ ! -f "$TESTING_GUIDANCE" ]]; then
+    echo -e "${RED}Error: Testing guidance not found: $TESTING_GUIDANCE${NC}"
+    exit 1
+fi
+
+# --- Check for uncommitted changes ---
+if ! git diff-index --quiet HEAD -- 2>/dev/null; then
+    if [[ "$FORCE" == true ]]; then
+        echo -e "${YELLOW}Warning: Uncommitted changes detected (continuing with --force)${NC}"
+    else
+        echo -e "${YELLOW}Warning: You have uncommitted changes.${NC}"
+        echo "The QA loop may commit bug fixes. Consider committing or stashing first."
+        echo "Use --force to skip this check."
+        echo ""
+        read -p "Continue anyway? (y/N) " -n 1 -r
+        echo
+        if [[ ! $REPLY =~ ^[Yy]$ ]]; then
+            exit 1
+        fi
+    fi
+fi
+
+# --- Initialize progress file ---
+if [[ ! -f "$PROGRESS_FILE" ]]; then
+    echo "# QA Progress Log - $CURRENT_BRANCH" > "$PROGRESS_FILE"
+    echo "" >> "$PROGRESS_FILE"
+    echo "Started: $(date)" >> "$PROGRESS_FILE"
+    echo "" >> "$PROGRESS_FILE"
+    echo "---" >> "$PROGRESS_FILE"
+    echo "" >> "$PROGRESS_FILE"
+fi
+
+# --- Compute cleaned diff (full or stat tree based on size) ---
+compute_diff() {
+    local MERGE_BASE
+    MERGE_BASE=$(git merge-base main HEAD 2>/dev/null || echo "HEAD~10")
+
+    # Full cleaned diff
+    local FULL_DIFF
+    FULL_DIFF=$(git diff "$MERGE_BASE"...HEAD \
+        -- ':!*.lock' ':!*lock.json' ':!*lock.yaml' \
+           ':!tmp/' ':!dist/' ':!build/' ':!.next/' ':!.turbo/' ':!node_modules/' \
+        2>/dev/null || echo "")
+
+    local DIFF_SIZE=${#FULL_DIFF}
+
+    if [[ "$DIFF_SIZE" -le "$DIFF_MAX_CHARS" && "$DIFF_SIZE" -gt 0 ]]; then
+        # Small enough — use full diff
+        echo "# Git diff (full — $(echo "$FULL_DIFF" | wc -l | tr -d ' ') lines)"
+        echo ""
+        echo "$FULL_DIFF"
+    elif [[ "$DIFF_SIZE" -gt "$DIFF_MAX_CHARS" ]]; then
+        # Too large — fall back to stat tree
+        echo "# Git diff (stat tree — full diff too large at ~$((DIFF_SIZE / 4)) tokens)"
+        echo ""
+        echo "Full diff exceeds token budget. Showing file-level summary."
+        echo "Read specific files with the Read tool for details."
+        echo ""
+        git diff --stat "$MERGE_BASE"...HEAD \
+            -- ':!*.lock' ':!*lock.json' ':!*lock.yaml' \
+               ':!tmp/' ':!dist/' ':!build/' ':!.next/' ':!.turbo/' ':!node_modules/' \
+            2>/dev/null || echo "(no stat available)"
+    else
+        echo "# Git diff"
+        echo ""
+        echo "(no changes detected against main)"
+    fi
+}
+
+# --- Assemble prompt from template ---
+assemble_prompt() {
+    # Extract dynamic fields from qa.json
+    local TEST_CONTEXT CODEBASE_CONTEXT
+    if command -v jq &> /dev/null; then
+        TEST_CONTEXT=$(jq -r '.testContext // ""' "$QA_FILE")
+        CODEBASE_CONTEXT=$(jq -r '.codebaseContext // ""' "$QA_FILE")
+    else
+        TEST_CONTEXT=""
+        CODEBASE_CONTEXT=""
+    fi
+
+    # Compute fresh diff
+    local DIFF
+    DIFF=$(compute_diff)
+
+    # Write diff file for reference
+    echo "$DIFF" > "$DIFF_FILE"
+
+    # Write content to temp files for safe Python substitution
+    local TEMPLATE_TMP TC_FILE CCC_FILE DIFF_CONTENT_FILE
+    TEMPLATE_TMP=$(mktemp)
+    TC_FILE=$(mktemp)
+    CCC_FILE=$(mktemp)
+    DIFF_CONTENT_FILE=$(mktemp)
+    cat "$TEMPLATE_FILE" > "$TEMPLATE_TMP"
+    echo "$TEST_CONTEXT" > "$TC_FILE"
+    echo "$CODEBASE_CONTEXT" > "$CCC_FILE"
+    echo "$DIFF" > "$DIFF_CONTENT_FILE"
+
+    # Use python3 for variant selection + multi-line placeholder substitution
+    python3 -c "
+import sys
+
+with open('$TEMPLATE_TMP', 'r') as f:
+    template = f.read()
+with open('$TC_FILE', 'r') as f:
+    tc = f.read().strip()
+with open('$CCC_FILE', 'r') as f:
+    ccc = f.read().strip()
+with open('$DIFF_CONTENT_FILE', 'r') as f:
+    diff = f.read().strip()
+
+spec_path = '$SPEC_PATH'
+
+# --- Variant selection ---
+# Select Variant A (with SPEC.md) or Variant B (without) based on --spec-path
+if spec_path:
+    marker_start = '## Variant A'
+    marker_end = '*End of Variant A*'
+else:
+    marker_start = '## Variant B'
+    marker_end = '*End of Variant B*'
+
+try:
+    start_idx = template.index(marker_start)
+    end_idx = template.index(marker_end)
+    section = template[start_idx:end_idx]
+    # Extract prompt content: between first --- and last --- in the section
+    first_hr = section.index('\n---\n') + 5
+    content = section[first_hr:].rstrip()
+    last_hr = content.rfind('\n---')
+    if last_hr > 0:
+        content = content[:last_hr].strip()
+except (ValueError, IndexError):
+    # Fallback: use full template if markers not found
+    content = template
+
+# --- Placeholder substitution ---
+content = content.replace('{{TEST_CONTEXT}}', tc)
+content = content.replace('{{CODEBASE_CONTEXT}}', ccc)
+content = content.replace('{{TESTING_GUIDANCE}}', open('$TESTING_GUIDANCE').read().strip())
+content = content.replace('{{SPEC_PATH}}', spec_path)
+content = content.replace('{{DIFF}}', diff)
+
+with open('$PROMPT_FILE', 'w') as f:
+    f.write(content)
+" 2>/dev/null
+
+    # Cleanup temp files
+    rm -f "$TEMPLATE_TMP" "$TC_FILE" "$CCC_FILE" "$DIFF_CONTENT_FILE"
+}
+
+# --- Main iteration loop ---
+for ((i=1; i<=MAX_ITERATIONS; i++)); do
+    echo ""
+    echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+    echo -e "${GREEN}  QA Iteration $i of $MAX_ITERATIONS${NC}"
+    echo -e "${BLUE}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+    echo ""
+
+    echo "## Iteration $i - $(date)" >> "$PROGRESS_FILE"
+    echo "" >> "$PROGRESS_FILE"
+
+    # Assemble prompt fresh each iteration (diff may change after qa-fix commits)
+    echo -e "${YELLOW}Assembling prompt...${NC}"
+    assemble_prompt
+
+    OUTPUT_FILE=$(mktemp)
+
+    # Spawn a fresh Claude Code process for this iteration.
+    # env -u: unset vars that prevent nested Claude Code sessions.
+    # -p: non-interactive mode (prompt from file).
+    # --dangerously-skip-permissions: no TTY for confirmation in -p mode.
+    # --max-turns: prevent runaway iterations.
+    # --output-format json: structured output for completion detection.
+    # < /dev/null: prevent stdin hang in nested subprocess invocations.
+    env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT claude \
+        -p "$(cat "$PROMPT_FILE")" \
+        --dangerously-skip-permissions \
+        --max-turns "$MAX_TURNS" \
+        --output-format json \
+        < /dev/null 2>&1 | tee "$OUTPUT_FILE" || true
+
+    # --- Post-iteration status ---
+    TOTAL_SCENARIOS=0
+    PASSING_SCENARIOS=0
+    if command -v jq &> /dev/null; then
+        TOTAL_SCENARIOS=$(jq '.scenarios | length' "$QA_FILE" 2>/dev/null || echo "0")
+        PASSING_SCENARIOS=$(jq '[.scenarios[] | select(.passes == true)] | length' "$QA_FILE" 2>/dev/null || echo "0")
+    else
+        TOTAL_SCENARIOS=$(grep -c '"id"' "$QA_FILE" 2>/dev/null || echo "0")
+        PASSING_SCENARIOS=$(grep -c '"passes"[[:space:]]*:[[:space:]]*true' "$QA_FILE" 2>/dev/null || echo "0")
+    fi
+    echo ""
+    echo -e "${GREEN}Iteration $i complete. Scenarios: $PASSING_SCENARIOS/$TOTAL_SCENARIOS passing.${NC}"
+
+    # --- Optional: validate qa.json integrity after iteration ---
+    if command -v bun &> /dev/null && [[ -f "$VALIDATE_SCRIPT" ]]; then
+        if ! bun "$VALIDATE_SCRIPT" "$QA_FILE" >/dev/null; then
+            echo -e "${YELLOW}Warning: qa.json validation issue detected after iteration $i.${NC}"
+            echo "qa.json validation warning after iteration $i" >> "$PROGRESS_FILE"
+            echo "" >> "$PROGRESS_FILE"
+        fi
+    fi
+
+    # Check for completion signal.
+    # The iteration agent outputs <promise>QA COMPLETE</promise> when all scenarios pass.
+    if grep -q "QA COMPLETE" "$OUTPUT_FILE"; then
+        # Verify qa.json actually has all scenarios complete.
+        # Prevents premature exit from false completion signals.
+        INCOMPLETE=0
+        if command -v jq &> /dev/null; then
+            INCOMPLETE=$(jq '[.scenarios[] | select(.passes == false)] | length' "$QA_FILE" 2>/dev/null || echo "0")
+        else
+            INCOMPLETE=$(grep -c '"passes"[[:space:]]*:[[:space:]]*false' "$QA_FILE" 2>/dev/null || echo "0")
+        fi
+
+        if [[ "$INCOMPLETE" -gt 0 ]]; then
+            echo -e "${YELLOW}Warning: Completion signal detected but $INCOMPLETE scenarios still incomplete.${NC}"
+            echo -e "${YELLOW}Continuing iteration...${NC}"
+            echo "Warning: false completion signal — $INCOMPLETE scenarios still at passes: false" >> "$PROGRESS_FILE"
+            echo "" >> "$PROGRESS_FILE"
+            rm -f "$OUTPUT_FILE"
+            sleep 2
+            continue
+        fi
+
+        rm -f "$OUTPUT_FILE"
+        echo ""
+        echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+        echo -e "${GREEN}  All QA scenarios complete.${NC}"
+        echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+        echo "" >> "$PROGRESS_FILE"
+        echo "## QA COMPLETED - $(date)" >> "$PROGRESS_FILE"
+
+        # Print summary
+        if command -v jq &> /dev/null; then
+            PASS_COUNT=$(jq '[.scenarios[] | select(.result == "pass")] | length' "$QA_FILE" 2>/dev/null || echo "0")
+            FIXED_COUNT=$(jq '[.scenarios[] | select(.result == "fail-fixed")] | length' "$QA_FILE" 2>/dev/null || echo "0")
+            BLOCKED_COUNT=$(jq '[.scenarios[] | select(.result == "blocked")] | length' "$QA_FILE" 2>/dev/null || echo "0")
+            SKIPPED_COUNT=$(jq '[.scenarios[] | select(.result == "skipped")] | length' "$QA_FILE" 2>/dev/null || echo "0")
+            echo ""
+            echo "Summary:"
+            echo "  Pass:    $PASS_COUNT"
+            echo "  Fixed:   $FIXED_COUNT"
+            echo "  Blocked: $BLOCKED_COUNT"
+            echo "  Skipped: $SKIPPED_COUNT"
+        fi
+
+        exit 0
+    fi
+
+    rm -f "$OUTPUT_FILE"
+
+    echo "" >> "$PROGRESS_FILE"
+    echo "---" >> "$PROGRESS_FILE"
+    echo "" >> "$PROGRESS_FILE"
+
+    sleep 2
+done
+
+echo ""
+echo -e "${YELLOW}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo -e "${YELLOW}  Max iterations reached ($MAX_ITERATIONS).${NC}"
+echo -e "${YELLOW}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
+echo ""
+echo "Check $QA_FILE for incomplete scenarios and $PROGRESS_FILE for blockers."
+exit 1
diff --git a/plugins/eng/skills/qa/scripts/validate-qa.ts b/plugins/eng/skills/qa/scripts/validate-qa.ts
new file mode 100644
index 00000000..746d5dfa
--- /dev/null
+++ b/plugins/eng/skills/qa/scripts/validate-qa.ts
@@ -0,0 +1,299 @@
+#!/usr/bin/env bun
+
+// qa.json schema validator — zero dependencies.
+// Usage: bun validate-qa.ts <path-to-qa.json>
+// Exit 0: valid | Exit 1: invalid or error
+//
+// Used in two places:
+// 1. Orchestrator — validate qa.json before crafting the prompt
+// 2. qa.sh — validate qa.json integrity after each iteration
+
+import { readFileSync } from 'fs';
+
+// --- Type definitions ---
+
+const VALID_CATEGORIES = [
+  'visual',
+  'e2e-flow',
+  'error-state',
+  'edge-case',
+  'integration',
+  'api-contract',
+  'usability',
+  'failure-mode',
+] as const;
+
+const VALID_TOOLS = ['browser', 'browser-inspection', 'cli', 'macos'] as const;
+
+const VALID_RESULTS = ['pass', 'fail-fixed', 'blocked', 'skipped', ''] as const;
+
+interface TestScenario {
+  id: string;
+  title: string;
+  description: string;
+  category: string;
+  tools: string[];
+  successCriteria: string[];
+  priority: number;
+  passes: boolean;
+  result: string;
+  notes: string;
+}
+
+interface QaJson {
+  project: string;
+  branchName: string;
+  description: string;
+  testContext: string;
+  availableTools: string[];
+  scenarios: TestScenario[];
+}
+
+// --- Validation helpers ---
+
+function requireString(obj: Record<string, unknown>, field: string, label: string): string | null {
+  const val = obj[field];
+  if (typeof val !== 'string') return `${label}: must be a string (got ${typeof val})`;
+  if (val.length === 0) return `${label}: must not be empty`;
+  return null;
+}
+
+function validateScenario(raw: unknown, index: number): string[] {
+  const errors: string[] = [];
+  const prefix = `scenarios[${index}]`;
+
+  if (typeof raw !== 'object' || raw === null || Array.isArray(raw)) {
+    return [`${prefix}: must be an object`];
+  }
+
+  const scenario = raw as Record<string, unknown>;
+
+  // id — must match QA-NNN
+  if (typeof scenario.id !== 'string') {
+    errors.push(`${prefix}.id: must be a string (got ${typeof scenario.id})`);
+  } else if (!/^QA-\d{3}$/.test(scenario.id)) {
+    errors.push(`${prefix}.id: must match QA-NNN format (e.g., QA-001), got "${scenario.id}"`);
+  }
+
+  // Required string fields
+  for (const field of ['title', 'description'] as const) {
+    const err = requireString(scenario, field, `${prefix}.${field}`);
+    if (err) errors.push(err);
+  }
+
+  // category — must be a valid enum value
+  if (typeof scenario.category !== 'string') {
+    errors.push(`${prefix}.category: must be a string (got ${typeof scenario.category})`);
+  } else if (!(VALID_CATEGORIES as readonly string[]).includes(scenario.category)) {
+    errors.push(
+      `${prefix}.category: must be one of [${VALID_CATEGORIES.join(', ')}], got "${scenario.category}"`,
+    );
+  }
+
+  // tools — non-empty array of valid tool strings
+  if (!Array.isArray(scenario.tools)) {
+    errors.push(`${prefix}.tools: must be an array`);
+  } else if (scenario.tools.length === 0) {
+    errors.push(`${prefix}.tools: must have at least one tool`);
+  } else {
+    for (let i = 0; i < scenario.tools.length; i++) {
+      const t = scenario.tools[i];
+      if (typeof t !== 'string') {
+        errors.push(`${prefix}.tools[${i}]: must be a string`);
+      } else if (!(VALID_TOOLS as readonly string[]).includes(t)) {
+        errors.push(
+          `${prefix}.tools[${i}]: must be one of [${VALID_TOOLS.join(', ')}], got "${t}"`,
+        );
+      }
+    }
+  }
+
+  // successCriteria — non-empty array of non-empty strings
+  if (!Array.isArray(scenario.successCriteria)) {
+    errors.push(`${prefix}.successCriteria: must be an array`);
+  } else if (scenario.successCriteria.length === 0) {
+    errors.push(`${prefix}.successCriteria: must have at least one criterion`);
+  } else {
+    for (let i = 0; i < scenario.successCriteria.length; i++) {
+      const c = scenario.successCriteria[i];
+      if (typeof c !== 'string') {
+        errors.push(`${prefix}.successCriteria[${i}]: must be a string`);
+      } else if (c.length === 0) {
+        errors.push(`${prefix}.successCriteria[${i}]: must not be empty`);
+      }
+    }
+  }
+
+  // priority — positive integer
+  if (typeof scenario.priority !== 'number') {
+    errors.push(`${prefix}.priority: must be a number (got ${typeof scenario.priority})`);
+  } else if (!Number.isInteger(scenario.priority) || scenario.priority < 1) {
+    errors.push(`${prefix}.priority: must be a positive integer (got ${scenario.priority})`);
+  }
+
+  // passes — boolean
+  if (typeof scenario.passes !== 'boolean') {
+    errors.push(`${prefix}.passes: must be a boolean (got ${typeof scenario.passes})`);
+  }
+
+  // result — must be a valid enum value
+  if (typeof scenario.result !== 'string') {
+    errors.push(`${prefix}.result: must be a string (got ${typeof scenario.result})`);
+  } else if (!(VALID_RESULTS as readonly string[]).includes(scenario.result)) {
+    errors.push(
+      `${prefix}.result: must be one of [${VALID_RESULTS.map((r) => r || '""').join(', ')}], got "${scenario.result}"`,
+    );
+  }
+
+  // notes — must be a string (can be empty)
+  if (typeof scenario.notes !== 'string') {
+    errors.push(`${prefix}.notes: must be a string (got ${typeof scenario.notes})`);
+  }
+
+  return errors;
+}
+
+function validateSchema(json: unknown): { qa: QaJson | null; errors: string[] } {
+  const errors: string[] = [];
+
+  if (typeof json !== 'object' || json === null || Array.isArray(json)) {
+    return { qa: null, errors: ['Root: must be a JSON object'] };
+  }
+
+  const obj = json as Record<string, unknown>;
+
+  // Required top-level string fields
+  for (const field of ['project', 'branchName', 'description', 'testContext'] as const) {
+    const err = requireString(obj, field, field);
+    if (err) errors.push(err);
+  }
+
+  // availableTools — non-empty array of valid tool strings
+  if (!Array.isArray(obj.availableTools)) {
+    errors.push('availableTools: must be an array');
+  } else if (obj.availableTools.length === 0) {
+    errors.push('availableTools: must have at least one tool');
+  } else {
+    for (let i = 0; i < obj.availableTools.length; i++) {
+      const t = obj.availableTools[i];
+      if (typeof t !== 'string') {
+        errors.push(`availableTools[${i}]: must be a string`);
+      } else if (!(VALID_TOOLS as readonly string[]).includes(t)) {
+        errors.push(
+          `availableTools[${i}]: must be one of [${VALID_TOOLS.join(', ')}], got "${t}"`,
+        );
+      }
+    }
+  }
+
+  // scenarios — non-empty array
+  if (!Array.isArray(obj.scenarios)) {
+    errors.push('scenarios: must be an array');
+  } else if (obj.scenarios.length === 0) {
+    errors.push('scenarios: must have at least one test scenario');
+  } else {
+    for (let i = 0; i < obj.scenarios.length; i++) {
+      errors.push(...validateScenario(obj.scenarios[i], i));
+    }
+  }
+
+  if (errors.length > 0) return { qa: null, errors };
+  return { qa: json as unknown as QaJson, errors: [] };
+}
+
+// --- Semantic checks (beyond structural schema) ---
+
+function semanticChecks(qa: QaJson): string[] {
+  const errors: string[] = [];
+
+  const ids = qa.scenarios.map((s) => s.id);
+  const priorities = qa.scenarios.map((s) => s.priority).sort((a, b) => a - b);
+
+  // Duplicate IDs
+  const idSet = new Set(ids);
+  if (idSet.size !== ids.length) {
+    const dupes = ids.filter((id, i) => ids.indexOf(id) !== i);
+    errors.push(`Duplicate scenario IDs: ${[...new Set(dupes)].join(', ')}`);
+  }
+
+  // Duplicate priorities
+  const prioritySet = new Set(priorities);
+  if (prioritySet.size !== priorities.length) {
+    const dupes = priorities.filter((p, i) => priorities.indexOf(p) !== i);
+    errors.push(`Duplicate priorities: ${[...new Set(dupes)].join(', ')}`);
+  }
+
+  // Sequential priorities (1, 2, 3, ..., N) — only check if no duplicates
+  if (prioritySet.size === priorities.length) {
+    for (let i = 0; i < priorities.length; i++) {
+      if (priorities[i] !== i + 1) {
+        errors.push(
+          `Non-sequential priorities: expected ${i + 1} at position ${i}, found ${priorities[i]}. Priorities must be sequential starting from 1.`,
+        );
+        break;
+      }
+    }
+  }
+
+  // Every scenario's tools must be a subset of availableTools
+  for (const scenario of qa.scenarios) {
+    for (const tool of scenario.tools) {
+      if (!qa.availableTools.includes(tool)) {
+        errors.push(
+          `${scenario.id}: requires tool "${tool}" which is not in availableTools [${qa.availableTools.join(', ')}]`,
+        );
+      }
+    }
+  }
+
+  return errors;
+}
+
+// --- CLI ---
+
+const filePath = process.argv[2];
+if (!filePath) {
+  console.error('Usage: validate-qa.ts <path-to-qa.json>');
+  process.exit(1);
+}
+
+try {
+  const content = readFileSync(filePath, 'utf-8');
+
+  let json: unknown;
+  try {
+    json = JSON.parse(content);
+  } catch {
+    console.error(`Error: ${filePath} is not valid JSON`);
+    process.exit(1);
+  }
+
+  const { qa, errors: schemaErrors } = validateSchema(json);
+
+  if (schemaErrors.length > 0) {
+    console.error('Schema validation failed:');
+    for (const err of schemaErrors) {
+      console.error(`  ${err}`);
+    }
+    process.exit(1);
+  }
+
+  const semanticErrors = semanticChecks(qa!);
+  if (semanticErrors.length > 0) {
+    console.error('Semantic validation failed:');
+    for (const err of semanticErrors) {
+      console.error(`  ${err}`);
+    }
+    process.exit(1);
+  }
+
+  // Status summary on success
+  const total = qa!.scenarios.length;
+  const passing = qa!.scenarios.filter((s) => s.passes).length;
+  console.log(`qa.json valid. Scenarios: ${passing}/${total} passing.`);
+  process.exit(0);
+} catch (e: unknown) {
+  const message = e instanceof Error ? e.message : String(e);
+  console.error(`Error reading ${filePath}: ${message}`);
+  process.exit(1);
+}
diff --git a/plugins/eng/skills/qa/templates/qa-prompt.template.md b/plugins/eng/skills/qa/templates/qa-prompt.template.md
new file mode 100644
index 00000000..44f2cc02
--- /dev/null
+++ b/plugins/eng/skills/qa/templates/qa-prompt.template.md
@@ -0,0 +1,261 @@
+Use when: Orchestrator Phase 3 — crafting the QA iteration prompt for `tmp/ship/qa-prompt.md`
+Priority: P0
+Impact: Without this template, the iteration prompt must be reconstructed from memory — risk of missing critical sections (completion signal, testing methodology, progress format, tool-skip handling)
+
+---
+
+# QA Prompt Template
+
+This file contains two complete prompt variants for the QA iteration agents. The `qa.sh` script selects the correct variant based on `--spec-path`, fills all `{{PLACEHOLDERS}}`, and saves the result to `tmp/ship/qa-prompt.md`.
+
+**Do NOT include both variants in the saved prompt.** The iteration agent sees a single, unconditional workflow — never both variants, never conditional logic.
+
+## Placeholders
+
+| Placeholder | Source | Used in |
+|---|---|---|
+| `{{SPEC_PATH}}` | Path to SPEC.md — from `--spec-path` argument, injected by qa.sh | Variant A only |
+| `{{CODEBASE_CONTEXT}}` | From qa.json `codebaseContext` — injected by qa.sh | Both |
+| `{{TEST_CONTEXT}}` | From qa.json `testContext` — injected by qa.sh | Both |
+| `{{TESTING_GUIDANCE}}` | Full content of `references/testing-guidance.md` — injected by qa.sh | Both |
+| `{{DIFF}}` | Cleaned git diff (full if small, stat tree if large) — injected by qa.sh each iteration | Both |
+
+---
+
+## Variant A — SPEC.md available
+
+Use when a spec path is known. Copy everything from the horizontal rule below through "End of Variant A", fill all placeholders, and save.
+
+---
+
+You are a QA engineer executing test scenarios from a structured test plan. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct.
+
+### 1. Read the SPEC
+
+**FIRST ACTION — do this before anything else.**
+
+Read the SPEC.md at `{{SPEC_PATH}}`. This is your primary reference for understanding what was built — acceptance criteria, architecture decisions, non-goals, and design constraints.
+
+### 2. Check status
+
+Read `tmp/ship/qa.json` for test scenarios and their completion status. Note the `availableTools` field — this tells you what testing capabilities you have. Do not probe for tools yourself; trust what the orchestrator detected.
+
+Read `tmp/ship/qa-progress.txt` for learnings from previous iterations.
+
+### 3. Select scenario
+
+Select the highest-priority incomplete scenario (`passes: false`).
+
+If the scenario requires tools not listed in `availableTools`:
+1. Set `result: "skipped"` and `passes: true` in qa.json
+2. Set `notes` explaining what tool was needed and why it couldn't be tested
+3. Log the skip to qa-progress.txt
+4. Move to the next scenario
+
+### 4. Ensure system running
+
+Follow the test context instructions to ensure the system under test is running:
+
+{{TEST_CONTEXT}}
+
+If the system cannot be started, set `result: "blocked"` on the current scenario with details in `notes`.
+
+### 5. Execute test
+
+Run the scenario using the strongest available tools. Follow `successCriteria` exactly — each criterion is a pass/fail checkpoint.
+
+**Codebase context:**
+
+{{CODEBASE_CONTEXT}}
+
+**Changes under test:**
+
+{{DIFF}}
+
+**Testing methodology:**
+
+{{TESTING_GUIDANCE}}
+
+### 6. Record result
+
+Update `tmp/ship/qa.json` for the completed scenario:
+
+| Outcome | Set |
+|---|---|
+| All successCriteria met | `passes: true`, `result: "pass"`, `notes: "<brief confirmation>"` |
+| Bug found and fixed | `passes: true`, `result: "fail-fixed"`, `notes: "<what was wrong, how you fixed it>"` |
+| Bug found, cannot fix | `passes: false`, `result: "blocked"`, `notes: "<what went wrong, why unresolvable>"` |
+| Tool not available | `passes: true`, `result: "skipped"`, `notes: "<what tool was needed>"` |
+
+Update qa.json **before** committing so progress is captured even if context runs out.
+
+### 7. If bug found and fixable
+
+1. Fix the bug
+2. Re-verify the successCriteria pass
+3. Commit with message format: `[qa-fix] description of fix`
+4. Do NOT commit files in `tmp/` — they are gitignored execution state
+
+### 8. Log progress
+
+Append to `tmp/ship/qa-progress.txt` using this format:
+
+```
+## Iteration N - [timestamp]
+
+### Scenario: [scenario-id] - [title]
+
+**Result:** [pass | fail-fixed | blocked | skipped]
+
+**What was tested:**
+- [actions taken]
+
+**Findings:**
+- [what was observed]
+
+**Learnings:**
+- [patterns discovered]
+- [gotchas encountered]
+- [insights for future iterations]
+
+---
+```
+
+### 9. If stuck
+
+If stuck on a scenario:
+1. Set `notes` on that scenario in qa.json with the blocker description
+2. Set `result: "blocked"` (keep `passes: false`)
+3. Log the blocker to qa-progress.txt
+4. Move to the next scenario
+
+### 10. Completion check
+
+When ALL scenarios have `passes: true`, output exactly:
+
+<promise>QA COMPLETE</promise>
+
+Output this ONLY when the statement is genuinely true. Do not output false promises.
+
+**CRITICAL:** NEVER mention, quote, or reference the completion signal string anywhere in your output except when actually signaling completion. Do not write "when done, I will output QA COMPLETE" or similar in qa-progress.txt or in conversation. Detection is literal string matching — any occurrence terminates the loop.
+
+---
+
+*End of Variant A*
+
+---
+
+## Variant B — No SPEC.md
+
+Use when only qa.json is available. The `description` and `testContext` fields are the sole context. Copy everything from the horizontal rule below through "End of Variant B", fill all placeholders, and save.
+
+---
+
+You are a QA engineer executing test scenarios from a structured test plan. Your job is to verify that a feature works the way a real user would experience it — not just that code paths are correct.
+
+### 1. Check status
+
+Read `tmp/ship/qa.json` for test scenarios, their completion status, and the `availableTools` field. The `description` field tells you what's being tested. Do not probe for tools yourself; trust what the orchestrator detected.
+
+Read `tmp/ship/qa-progress.txt` for learnings from previous iterations.
+
+### 2. Select scenario
+
+Select the highest-priority incomplete scenario (`passes: false`).
+
+If the scenario requires tools not listed in `availableTools`:
+1. Set `result: "skipped"` and `passes: true` in qa.json
+2. Set `notes` explaining what tool was needed and why it couldn't be tested
+3. Log the skip to qa-progress.txt
+4. Move to the next scenario
+
+### 3. Ensure system running
+
+Follow the test context instructions to ensure the system under test is running:
+
+{{TEST_CONTEXT}}
+
+If the system cannot be started, set `result: "blocked"` on the current scenario with details in `notes`.
+
+### 4. Execute test
+
+Run the scenario using the strongest available tools. Follow `successCriteria` exactly — each criterion is a pass/fail checkpoint.
+
+**Codebase context:**
+
+{{CODEBASE_CONTEXT}}
+
+**Changes under test:**
+
+{{DIFF}}
+
+**Testing methodology:**
+
+{{TESTING_GUIDANCE}}
+
+### 5. Record result
+
+Update `tmp/ship/qa.json` for the completed scenario:
+
+| Outcome | Set |
+|---|---|
+| All successCriteria met | `passes: true`, `result: "pass"`, `notes: "<brief confirmation>"` |
+| Bug found and fixed | `passes: true`, `result: "fail-fixed"`, `notes: "<what was wrong, how you fixed it>"` |
+| Bug found, cannot fix | `passes: false`, `result: "blocked"`, `notes: "<what went wrong, why unresolvable>"` |
+| Tool not available | `passes: true`, `result: "skipped"`, `notes: "<what tool was needed>"` |
+
+Update qa.json **before** committing so progress is captured even if context runs out.
+
+### 6. If bug found and fixable
+
+1. Fix the bug
+2. Re-verify the successCriteria pass
+3. Commit with message format: `[qa-fix] description of fix`
+4. Do NOT commit files in `tmp/` — they are gitignored execution state
+
+### 7. Log progress
+
+Append to `tmp/ship/qa-progress.txt` using this format:
+
+```
+## Iteration N - [timestamp]
+
+### Scenario: [scenario-id] - [title]
+
+**Result:** [pass | fail-fixed | blocked | skipped]
+
+**What was tested:**
+- [actions taken]
+
+**Findings:**
+- [what was observed]
+
+**Learnings:**
+- [patterns discovered]
+- [gotchas encountered]
+- [insights for future iterations]
+
+---
+```
+
+### 8. If stuck
+
+If stuck on a scenario:
+1. Set `notes` on that scenario in qa.json with the blocker description
+2. Set `result: "blocked"` (keep `passes: false`)
+3. Log the blocker to qa-progress.txt
+4. Move to the next scenario
+
+### 9. Completion check
+
+When ALL scenarios have `passes: true`, output exactly:
+
+<promise>QA COMPLETE</promise>
+
+Output this ONLY when the statement is genuinely true. Do not output false promises.
+
+**CRITICAL:** NEVER mention, quote, or reference the completion signal string anywhere in your output except when actually signaling completion. Do not write "when done, I will output QA COMPLETE" or similar in qa-progress.txt or in conversation. Detection is literal string matching — any occurrence terminates the loop.
+
+---
+
+*End of Variant B*
diff --git a/plugins/eng/skills/ship/SKILL.md b/plugins/eng/skills/ship/SKILL.md
index 48e34bdb..697f5112 100644
--- a/plugins/eng/skills/ship/SKILL.md
+++ b/plugins/eng/skills/ship/SKILL.md
@@ -58,6 +58,8 @@ All execution state lives in `tmp/ship/` (gitignored). The only committed artifa
 | `tmp/ship/last-prompt.md` | Last re-injection prompt — the full prompt the stop hook constructed on its most recent re-entry, for debugging | Stop hook | Each re-entry (overwritten) | Debugging only |
 | `tmp/ship/spec.json` | User stories — acceptance criteria, priority, pass/fail status | Phase 2 (/implement) | Each iteration (sets `passes: true`) | implement.sh, iterations, Ship |
 | `tmp/ship/progress.txt` | Iteration log — what was done, learnings, blockers | Phase 2 start (implement.sh) | Each iteration (append) | Iterations, Ship |
+| `tmp/ship/qa.json` | Test scenarios — success criteria, tools, priority, pass/fail status, results | Phase 3 (/qa) | Each QA iteration (sets `passes`, `result`, `notes`) | qa.sh, QA iterations, Ship |
+| `tmp/ship/qa-progress.txt` | QA iteration log — what was tested, findings, learnings, blockers | Phase 3 start (qa.sh) | Each QA iteration (append) | QA iterations, Ship |
 | SPEC.md *(committed)* | Product + tech spec — requirements, design, decisions, non-goals | Phase 1 (/spec or user) | Phase 1 only | All phases, iterations |
 
 ### When to update what
@@ -66,9 +68,11 @@ All execution state lives in `tmp/ship/` (gitignored). The only committed artifa
 |---|---|---|
 | **Phase 1 end** | **Run** `ship-init-state.sh` — creates both `state.json` and `loop.md` (see Phase 1, Step 3) | — |
 | **Phase 2 start** | — | `/implement` creates `tmp/ship/spec.json`, `tmp/ship/implement-prompt.md`, `tmp/ship/progress.txt` |
+| **Phase 3 start** | — | `/qa` creates `tmp/ship/qa.json`, `tmp/ship/qa-prompt.md`, `tmp/ship/qa-progress.txt` |
 | **Any phase → next** | Set `currentPhase` to next phase, append completed phase to `completedPhases`, refresh `lastUpdated` | — |
 | **User amendment** (any phase) | Append to `amendments[]`: `{"description": "...", "status": "pending"}` | — |
 | **Iteration completes a story** | — | `tmp/ship/spec.json`: set story `passes: true`. `tmp/ship/progress.txt`: append iteration log. |
+| **QA iteration completes a scenario** | — | `tmp/ship/qa.json`: set scenario `passes`, `result`, `notes`. `tmp/ship/qa-progress.txt`: append iteration log. |
 | **PR created** (after Phase 2) | Set `prNumber` | Draft PR created on GitHub |
 | **Phase 6 → completed** | Set `currentPhase: "completed"` | Stop hook deletes `loop.md` |
 | **Stop hook re-entry** | — | `loop.md`: iteration incremented. Prompt re-injected from `state.json` + SKILL.md. |
@@ -289,14 +293,17 @@ After Phase 2 completes and before entering Phase 3. Do not update `currentPhase
 
 ### Phase 3: Testing
 
-Load `/qa` skill with the SPEC.md path (or PR number if no spec). `/qa` handles the full manual QA lifecycle: tool detection, test plan derivation, execution with available tools (browser, macOS, bash), result recording, and gap documentation.
+Load `/qa` skill with the SPEC.md path (or PR number if no spec). `/qa` runs a subprocess iteration loop — it generates `qa.json` (structured test plan with scenarios and success criteria), crafts an iteration prompt, and executes `qa.sh` which spawns fresh Claude Code subprocesses per scenario.
 
-If scope calibration indicated a lightweight scope (bug fix / config change), pass that context so `/qa` calibrates depth accordingly.
+Pass the scope calibration context so `/qa` calibrates scenario depth accordingly (fewer scenarios for bug fixes, full coverage for features).
+
+`/qa` handles the full lifecycle: tool detection (once, stored in qa.json), scenario derivation, subprocess execution, result recording, bug fixing, and gap documentation. Results are tracked in `tmp/ship/qa.json` and `tmp/ship/qa-progress.txt`.
 
 **Phase 3 exit gate — verify before proceeding to Phase 4:**
 
-- [ ] `/qa` complete: has run, fixed what it could, and documented results. Remaining gaps and unresolvable issues are documented — they do not block Phase 4.
-- [ ] If `/qa` made any code changes: re-run quality gates (test suite, typecheck, lint) and verify green. `/qa` fixes bugs it finds — you own verification that those fixes don't break anything else.
+- [ ] `/qa` complete: qa.json shows all scenarios with `passes: true`. Check results: how many passed, how many were fixed, how many blocked/skipped.
+- [ ] If `/qa` made any code changes (bug fixes via `[qa-fix]` commits): re-run quality gates (test suite, typecheck, lint) and verify green. `/qa` fixes bugs it finds — you own verification that those fixes don't break anything else.
+- [ ] Remaining gaps and unresolvable issues are documented in qa.json `notes` fields — they do not block Phase 4.
 - [ ] You can explain the implementation to another engineer: what was tested, what edge cases exist, how they are handled
 
 ---
@@ -429,7 +436,7 @@ These govern your behavior throughout:
 | Path | Use when | Impact if skipped |
 |---|---|---|
 | `/implement` skill | Converting spec, crafting prompt, and executing the iteration loop (Phase 2) | Missing spec.json, no implementation prompt, no automated execution |
-| `/qa` skill | Manual QA verification with available tools (Phase 3) | User-facing bugs missed, visual issues, broken UX flows, undocumented gaps |
+| `/qa` skill | QA subprocess loop — generates qa.json, crafts iteration prompt, runs qa.sh (Phase 3). Results in `tmp/ship/qa.json` and `tmp/ship/qa-progress.txt` | User-facing bugs missed, visual issues, broken UX flows, undocumented gaps |
 | `/pr` skill | Writing the full PR body (after Phase 3) and updating it after subsequent phases | Inconsistent PR body, missing sections, stale description |
 | `/docs` skill | Writing or updating documentation — product + internal surface areas (Phase 4) | Docs not written, wrong format, missed documentation surfaces, mismatched with project conventions |
 | `/review` skill | Running the push → review → fix → CI/CD loop (Phase 5) | Missed feedback, unresolved threads, mechanical response to reviews, CI/CD failures not investigated |