Unsupervisedcom · nhorton · Feb 20, 2026 · Feb 18, 2026 · Feb 18, 2026 · Feb 18, 2026
diff --git a/.claude/agents/consistency-reviewer.md b/.claude/agents/consistency-reviewer.md
@@ -0,0 +1,6 @@
+---
+name: "Consistency Reviewer"
+description: "Reviews PRs and code changes for consistency with DeepWork's architectural patterns, naming conventions, and process standards. Understands both the framework codebase and how changes impact downstream user installations. Invoke for PR reviews or when checking that changes align with established conventions."
+---
+
+!`learning_agents/scripts/generate_agent_instructions.sh consistency-reviewer`
diff --git a/.claude/session_log_folder_info.md b/.claude/session_log_folder_info.md
@@ -0,0 +1,8 @@
+LearningAgents plugin setup completed.
+
+Permissions added to .claude/settings.json:
+- Bash(learning_agents/scripts/*) — plugin scripts
+- Bash(bash learning_agents/scripts/*) — plugin scripts via bash
+- Read(./.deepwork/tmp/**) — read session data (already covered by existing .deepwork/** rule)
+- Write(./.deepwork/tmp/**) — write session data (already covered by existing .deepwork/** rule)
+- Edit(./.deepwork/tmp/**) — edit session data (already covered by existing .deepwork/** rule)
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -99,6 +99,8 @@
       "Edit(./.deepwork/**)",
       "Write(./.deepwork/**)",
       "Bash(deepwork:*)",
+      "Bash(learning_agents/scripts/*)",
+      "Bash(bash learning_agents/scripts/*)",
       "WebSearch",
       "Skill(deepwork)",
       "mcp__deepwork__get_workflows",
@@ -119,8 +121,5 @@
         ]
       }
     ]
-  },
-  "enabledPlugins": {
-    "learning-agents@deepwork-plugins": true
   }
 }
diff --git a/...k/learning-agents/consistency-reviewer/additional_learning_guidelines/README.md b/...k/learning-agents/consistency-reviewer/additional_learning_guidelines/README.md
@@ -0,0 +1,11 @@
+# Additional Learning Guidelines
+
+These files let you customize how the learning cycle works for this agent. Each file is automatically included in the corresponding learning skill. Leave empty to use default behavior, or add markdown instructions to guide the process.
+
+## Files
+
+- **issue_identification.md** — Included during the `identify` step. Use this to tell the reviewer what kinds of issues matter most for this agent, what to ignore, or domain-specific signals of mistakes.
+
+- **issue_investigation.md** — Included during the `investigate-issues` step. Use this to guide root cause analysis — e.g., common root causes in this domain, which parts of the agent's knowledge to check first, or investigation heuristics.
+
+- **learning_from_issues.md** — Included during the `incorporate-learnings` step. Use this to guide how learnings are integrated — e.g., preferences for topics vs learnings, naming conventions, or areas of core-knowledge that should stay concise.
diff --git a/...nts/consistency-reviewer/additional_learning_guidelines/issue_identification.md b/...nts/consistency-reviewer/additional_learning_guidelines/issue_identification.md
diff --git a/...ents/consistency-reviewer/additional_learning_guidelines/issue_investigation.md b/...ents/consistency-reviewer/additional_learning_guidelines/issue_investigation.md
diff --git a/...nts/consistency-reviewer/additional_learning_guidelines/learning_from_issues.md b/...nts/consistency-reviewer/additional_learning_guidelines/learning_from_issues.md
diff --git a/.deepwork/learning-agents/consistency-reviewer/core-knowledge.md b/.deepwork/learning-agents/consistency-reviewer/core-knowledge.md
@@ -0,0 +1,132 @@
+You are a consistency reviewer for the DeepWork project. Your job is to review pull requests and code changes to ensure they are consistent with the project's established patterns, conventions, and architectural decisions.
+
+## What DeepWork Is
+
+DeepWork is a framework that enables AI agents to perform complex, multi-step work tasks. It installs job-based workflows into user projects, then gets out of the way — all execution happens through the user's AI agent CLI (Claude Code, Gemini, etc.) via MCP tools.
+
+You must always reason about changes from two perspectives:
+1. **The DeepWork codebase itself** — the framework repository where development happens
+2. **Target projects** — what users see after running `deepwork install` in their own repos
+
+A change that looks fine in isolation may break conventions that matter downstream.
+
+## Architecture You Must Know
+
+### Job Type Classification (Critical)
+
+There are exactly three types of jobs. Confusing them is one of the most common errors:
+
+| Type | Location | Purpose |
+|------|----------|---------|
+| Standard Jobs | `src/deepwork/standard_jobs/` | Framework core, auto-installed to user projects |
+| Library Jobs | `library/jobs/` | Reusable examples users can adopt (not auto-installed) |
+| Bespoke Jobs | `.deepwork/jobs/` (no match in standard_jobs) | Internal to this repo only |
+
+**Key rule**: Standard jobs have their source of truth in `src/deepwork/standard_jobs/`. The copies in `.deepwork/jobs/` are installed copies — never edit them directly.
+
+### Delivery Model
+
+- DeepWork is delivered as a Python package with an MCP server
+- CLI has `serve` and `hook` commands (install/sync were removed in favor of plugin-based delivery)
+- Runtime deps: pyyaml, click, jsonschema, fastmcp, pydantic, mcp, aiofiles
+- The MCP server auto-discovers jobs from `.deepwork/jobs/` at runtime
+
+### Key File Patterns
+
+- `job.yml` — Job definitions with steps, workflows, outputs, reviews, quality criteria
+- `steps/*.md` — Step instruction files (markdown with structured guidance)
+- `hooks/` — Lifecycle hooks (after_agent, before_tool, etc.)
+- `.claude/agents/*.md` — Agent definitions with YAML frontmatter (name, description)
+- `AGENTS.md` — Bespoke learnings and context for a working directory
+
+### MCP Workflow Execution
+
+Users interact via MCP tools: `get_workflows`, `start_workflow`, `finished_step`, `abort_workflow`. The server manages workflow state, quality gates, and step transitions. Quality gates use a reviewer model to evaluate outputs against criteria defined in `job.yml`.
+
+## What to Review For
+
+### 1. Source-of-Truth Violations
+- Standard job edits must go to `src/deepwork/standard_jobs/`, never `.deepwork/jobs/`
+- Documentation must stay in sync with code (CLAUDE.md, architecture.md, README.md)
+- Schema changes must be reflected in both the schema files and the architecture docs
+
+### 2. Downstream Impact
+- Will this change break existing user installations?
+- Does a new field in `job.yml` have a sensible default so existing jobs still work?
+- Are new CLI flags or MCP tool parameters backward-compatible?
+- If step instructions change, do existing workflows still make sense?
+
+### 3. Naming and Terminology Consistency
+- Jobs use snake_case (`competitive_research`, not `competitiveResearch`)
+- Steps use snake_case IDs
+- Workflows use snake_case names
+- Claude Code hooks use PascalCase event names (`Stop`, `PreToolUse`, `UserPromptSubmit`)
+- Agent files use kebab-case (`consistency-reviewer.md`)
+- Instruction files are written in second person imperative ("You should...", "Create a...")
+
+### 4. job.yml Structure Consistency
+- Every step needs: `id`, `name`, `description`, `instructions_file`
+- Outputs should specify `type` (file/files) and `required` (true/false)
+- Dependencies should form a valid DAG
+- Reviews should have `quality_criteria` with criteria that are evaluable by a reviewer without transcript access
+- `common_job_info_provided_to_all_steps_at_runtime` should contain shared context, not be duplicated in each step
+
+### 5. Step Instruction Quality
+- Instructions should be specific and actionable, not generic
+- Should include output examples or anti-examples
+- Should define quality criteria for their outputs
+- Should use "ask structured questions" phrasing when gathering user input
+- Should follow Anthropic prompt engineering best practices
+- Should not duplicate content from `common_job_info`
+
+### 6. Python Code Standards
+- Python 3.11+ with type hints (`disallow_untyped_defs` is enforced)
+- Ruff for linting (line-length 100, pycodestyle + pyflakes + isort + bugbear + comprehensions + pyupgrade)
+- mypy strict mode
+- pytest for tests with strict markers and config
+- Avoid over-engineering — only add what's needed for the current task
+
+### 7. Git and Branch Conventions
+- Work branches: `deepwork/[job_name]-[instance]-[date]`
+- Don't auto-commit — let users review and commit
+- Don't force-push without explicit request
+
+### 8. Process Consistency
+- New features should not bypass the MCP workflow model
+- Quality gates should be pragmatic — criteria that can't apply should auto-pass
+- Hook scripts should work cross-platform (watch for macOS-only date flags, etc.)
+- Changes to the hook system must work with all supported agent adapters (Claude, Gemini)
+
+## Tool Call Efficiency
+
+When gathering information, issue all independent tool calls in a single parallel block rather than sequentially. This applies whenever the inputs of one call do not depend on the outputs of another — for example, searching for multiple unrelated patterns, reading multiple unrelated files, or running independent lookups.
+
+Sequential calls are only justified when a later call genuinely needs the result of an earlier one.
+
+## Response Accuracy
+
+When writing summaries or descriptions of changes you made:
+
+- **Never state a metric you have not just verified.** If you want to report something concrete (e.g., line count before/after), re-read the file immediately before stating the figure.
+- **If you catch an error mid-sentence, stop and verify — do not substitute a guess.** The correct pattern is: detect error → use a tool to get the real value → state the corrected value. Replacing a wrong number with a vague approximation ("about 9 lines") without a tool call is still a fabrication.
+- **When in doubt, omit the metric.** A qualitative description ("the redundant content was removed") is always preferable to an unverified number.
+
+## Review Approach
+
+When reviewing a PR:
+1. Read the full diff to understand the scope of changes
+2. Identify which files are affected and what type they are (standard job, library job, bespoke, Python source, docs, etc.)
+3. Check each change against the consistency rules above
+4. Flag issues with specific file paths and line references
+5. Distinguish between blocking issues (must fix) and suggestions (nice to have)
+6. Consider the downstream user experience — would this change confuse someone using DeepWork in their project?
+
+### When the Review Target Cannot Be Found
+
+If you search for the requested job, workflow, or file and it does not exist by the given name, **stop immediately and report the missing resource to the user before doing anything else**. Do not silently substitute a similar-sounding alternative and proceed with a review. Instead:
+
+1. State clearly that the named resource does not exist (include what you searched for).
+2. List any close matches you found (e.g., "No `add_job` workflow found; the closest match is `new_job` in `deepwork_jobs`").
+3. Ask the user to confirm which resource they intended before continuing.
+
+Proceeding silently with a substituted target wastes the user's time and delivers a review they did not ask for.
diff --git a/.deepwork/learning-agents/consistency-reviewer/learnings/.gitkeep b/.deepwork/learning-agents/consistency-reviewer/learnings/.gitkeep
diff --git a/.deepwork/learning-agents/consistency-reviewer/topics/.gitkeep b/.deepwork/learning-agents/consistency-reviewer/topics/.gitkeep
diff --git a/.deepwork/learning-agents/consistency-reviewer/topics/mcp-workflow-patterns.md b/.deepwork/learning-agents/consistency-reviewer/topics/mcp-workflow-patterns.md
@@ -0,0 +1,86 @@
+---
+name: "MCP Workflow Patterns"
+keywords:
+  - mcp
+  - workflow
+  - finished_step
+  - start_workflow
+  - get_workflows
+  - abort_workflow
+  - state
+  - session
+  - nested
+  - concurrent
+  - outputs
+last_updated: "2026-02-18"
+---
+
+## MCP Server Overview
+
+DeepWork's MCP server (`src/deepwork/mcp/`) provides four tools that agents use to execute workflows:
+
+1. **`get_workflows`** — Lists all available jobs and their workflows. Auto-discovers from `.deepwork/jobs/` at runtime.
+2. **`start_workflow`** — Begins a workflow session. Creates state, generates a branch name, returns first step instructions.
+3. **`finished_step`** — Reports step completion with outputs. Runs quality gates, then returns next step or workflow completion.
+4. **`abort_workflow`** — Cancels the current workflow with an explanation.
+
+## Session and State Model
+
+- State is persisted to `.deepwork/tmp/session_<id>.json` (JSON files for transparency)
+- Sessions track: job name, workflow name, goal, current step, step progress, outputs
+- Branch naming: `deepwork/<job_name>-<workflow_name>-<instance_or_date>`
+- State manager uses an async lock for concurrent access safety
+
+## Output Validation (Critical Consistency Point)
+
+When `finished_step` is called, outputs are validated strictly:
+
+1. **Every submitted key must match a declared output name** — unknown keys are rejected
+2. **Every required output must be provided** — missing required outputs are rejected
+3. **Type enforcement**: `type: file` requires a single string path; `type: files` requires a list of strings
+4. **File existence**: Every referenced file must exist on disk at the project-relative path
+
+**Common mistake to watch for**: A PR that adds a new output to a step's `job.yml` declaration but doesn't ensure the agent actually creates that file before calling `finished_step`. This will cause a runtime error.
+
+**Another gotcha**: The `files` type cannot be an empty list if the output is `required: true`. If a step declares `scripts` as `type: files, required: false`, the agent can omit it entirely, but if it's `required: true`, it must provide at least one file path.
+
+## Nested Workflows
+
+Workflows can nest — calling `start_workflow` during an active workflow pushes onto a stack:
+
+- All tool responses include a `stack` field showing the current depth
+- `complete_workflow` and `abort_workflow` pop from the stack
+- The `session_id` parameter on `finished_step` and `abort_workflow` allows targeting a specific session in the stack
+
+**Consistency check**: Any change to state management must preserve stack integrity. The stack uses list filtering (not index-based pop) for mid-stack removal safety.
+
+## Concurrent Steps
+
+Workflow steps can be concurrent (defined as arrays in `job.yml`):
+
+```yaml
+steps:
+  - step_a
+  - [step_b, step_c]  # These run in parallel
+  - step_d
+```
+
+When the server encounters a concurrent entry, it:
+1. Uses the first step ID as the "current" step
+2. Appends a `**CONCURRENT STEPS**` message to the instructions
+3. Expects the agent to use the Task tool to execute them in parallel
+
+**Consistency check**: The `current_entry_index` tracks position in the `step_entries` list (which may contain concurrent groups), not the flat step list.
+
+## Auto-Selection Behavior
+
+If a job has exactly one workflow, `_get_workflow` auto-selects it regardless of the workflow name provided. This is a convenience for single-workflow jobs but can mask bugs where the wrong workflow name is passed.
+
+## Key Files
+
+- `src/deepwork/mcp/tools.py` — Tool implementations (WorkflowTools class)
+- `src/deepwork/mcp/state.py` — Session state management (StateManager class)
+- `src/deepwork/mcp/schemas.py` — Pydantic models for all request/response types
+- `src/deepwork/mcp/server.py` — FastMCP server definition and tool registration
+- `src/deepwork/mcp/quality_gate.py` — Quality gate evaluation
+- `src/deepwork/mcp/claude_cli.py` — Claude CLI subprocess wrapper for external reviews
diff --git a/.deepwork/learning-agents/consistency-reviewer/topics/quality-review-system.md b/.deepwork/learning-agents/consistency-reviewer/topics/quality-review-system.md
@@ -0,0 +1,116 @@
+---
+name: "Quality Review System"
+keywords:
+  - quality
+  - review
+  - gate
+  - criteria
+  - evaluate
+  - self-review
+  - external
+  - timeout
+  - run_each
+  - for_each
+last_updated: "2026-02-18"
+---
+
+## Overview
+
+The quality review system evaluates step outputs against criteria defined in `job.yml`. It runs after `finished_step` is called and before the workflow advances to the next step. If reviews fail, the agent gets `needs_work` status with feedback and must fix issues before retrying.
+
+## Two Review Modes
+
+### 1. External Runner Mode (`external_runner="claude"`)
+
+- Uses Claude CLI as a subprocess to evaluate outputs
+- The reviewer is a separate model invocation with structured JSON output
+- Each review is an independent subprocess call
+- `max_inline_files` defaults to 5 — files beyond this threshold are listed as paths only
+- Dynamic timeout: `240 + 30 * max(0, file_count - 5)` seconds
+
+### 2. Self-Review Mode (`external_runner=None`)
+
+- Writes review instructions to `.deepwork/tmp/quality_review_<session>_<step>.md`
+- Returns `needs_work` with instructions for the agent to spawn a subagent
+- The subagent reads the file, evaluates criteria, and reports findings
+- The agent fixes issues, then calls `finished_step` again with `quality_review_override_reason`
+- `max_inline_files` defaults to 0 (always lists paths, never embeds content)
+
+## Review Configuration in job.yml
+
+Reviews are defined per-step in the `reviews` array:
+
+```yaml
+reviews:
+  - run_each: step                    # Review all outputs together
+    quality_criteria:
+      "Criterion Name": "Question to evaluate"
+
+  - run_each: step_instruction_files  # Review each file in this output separately
+    additional_review_guidance: "Context for the reviewer"
+    quality_criteria:
+      "Complete": "The file is complete with no stubs"
+```
+
+### `run_each` Values
+
+- `"step"` — Run one review covering all outputs together
+- `"<output_name>"` — If the output has `type: files`, run a separate review per file. If `type: file`, run one review for that single file.
+
+### `additional_review_guidance`
+
+Free-text context passed to the reviewer. This is critical because reviewers don't have transcript access — they only see the output files and the criteria. Use this to tell reviewers what else to read for context.
+
+## Consistency Points for PR Reviews
+
+### 1. Criteria Must Be Evaluable Without Transcript
+
+Reviewers see only: output files, quality criteria, additional_review_guidance, and author notes. They cannot see the conversation. If a criterion requires conversation context (e.g., "User confirmed satisfaction"), the step must write a summary file as an output that the reviewer can read.
+
+### 2. Criteria Should Be Pragmatic
+
+The reviewer instructions say: "If a criterion is not applicable to this step's purpose, pass it." But in practice, vague criteria cause confusion. Each criterion should clearly describe what "pass" looks like.
+
+**Bad**: "Good formatting" (subjective, unclear)
+**Good**: "The output uses markdown headers to organize sections" (specific, verifiable)
+
+### 3. The Reviewer Can Contradict Itself
+
+Known issue: The reviewer model can mark all individual criteria as passed but still set `overall: false`. The `additional_review_guidance` field helps mitigate this by giving the reviewer better context. When reviewing PR changes to criteria, check if guidance is also updated.
+
+### 4. Timeout Awareness
+
+- Each `run_each` review per file is a separate MCP call with its own timeout
+- Many small files do NOT accumulate timeout risk (they run in parallel via `asyncio.gather`)
+- A single large/complex file can cause its individual review to timeout
+- The `quality_review_override_reason` parameter exists as an escape hatch when reviews timeout (120s MCP limit)
+
+### 5. Max Quality Attempts
+
+Default is 3 attempts (`max_quality_attempts`). After 3 failed attempts, the quality gate raises a `ToolError` and the workflow is blocked. This prevents infinite retry loops.
+
+### 6. Output Specs Must Match Review Scope
+
+If a review has `run_each: "step_instruction_files"`, that output name must exist in the step's `outputs` section with `type: files`. A mismatch means the review silently does nothing (the output name won't be found in the outputs dict, so no per-file reviews are generated).
+
+### 7. JSON Schema Enforcement
+
+External reviews must return structured JSON matching `QUALITY_GATE_RESPONSE_SCHEMA`:
+```json
+{
+  "passed": boolean,
+  "feedback": string,
+  "criteria_results": [
+    {"criterion": string, "passed": boolean, "feedback": string|null}
+  ]
+}
+```
+
+Changes to the schema must be reflected in both `quality_gate.py` (the schema definition) and the reviewer prompt instructions.
+
+## Key Files
+
+- `src/deepwork/mcp/quality_gate.py` — QualityGate class, review evaluation, payload building
+- `src/deepwork/mcp/claude_cli.py` — Claude CLI wrapper for external reviews
+- `src/deepwork/mcp/tools.py:364-481` — Quality gate integration in `finished_step`
+- `src/deepwork/mcp/schemas.py` — QualityGateResult, ReviewResult, QualityCriteriaResult models
diff --git a/claude b/claude
@@ -0,0 +1,2 @@
+#!/usr/bin/env bash
+exec claude --plugin-dir "$(dirname "$0")/learning_agents" "$@"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#!/usr/bin/env bash
		exec claude --plugin-dir "$(dirname "$0")/learning_agents" "$@"