Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .claude/agents/consistency-reviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
name: "Consistency Reviewer"
description: "Reviews PRs and code changes for consistency with DeepWork's architectural patterns, naming conventions, and process standards. Understands both the framework codebase and how changes impact downstream user installations. Invoke for PR reviews or when checking that changes align with established conventions."
---

!`learning_agents/scripts/generate_agent_instructions.sh consistency-reviewer`
8 changes: 8 additions & 0 deletions .claude/session_log_folder_info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
LearningAgents plugin setup completed.

Permissions added to .claude/settings.json:
- Bash(learning_agents/scripts/*) β€” plugin scripts
- Bash(bash learning_agents/scripts/*) β€” plugin scripts via bash
- Read(./.deepwork/tmp/**) β€” read session data (already covered by existing .deepwork/** rule)
- Write(./.deepwork/tmp/**) β€” write session data (already covered by existing .deepwork/** rule)
- Edit(./.deepwork/tmp/**) β€” edit session data (already covered by existing .deepwork/** rule)
5 changes: 2 additions & 3 deletions .claude/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@
"Edit(./.deepwork/**)",
"Write(./.deepwork/**)",
"Bash(deepwork:*)",
"Bash(learning_agents/scripts/*)",
"Bash(bash learning_agents/scripts/*)",
"WebSearch",
"Skill(deepwork)",
"mcp__deepwork__get_workflows",
Expand All @@ -119,8 +121,5 @@
]
}
]
},
"enabledPlugins": {
"learning-agents@deepwork-plugins": true
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Additional Learning Guidelines

These files let you customize how the learning cycle works for this agent. Each file is automatically included in the corresponding learning skill. Leave empty to use default behavior, or add markdown instructions to guide the process.

## Files

- **issue_identification.md** β€” Included during the `identify` step. Use this to tell the reviewer what kinds of issues matter most for this agent, what to ignore, or domain-specific signals of mistakes.

- **issue_investigation.md** β€” Included during the `investigate-issues` step. Use this to guide root cause analysis β€” e.g., common root causes in this domain, which parts of the agent's knowledge to check first, or investigation heuristics.

- **learning_from_issues.md** β€” Included during the `incorporate-learnings` step. Use this to guide how learnings are integrated β€” e.g., preferences for topics vs learnings, naming conventions, or areas of core-knowledge that should stay concise.
132 changes: 132 additions & 0 deletions .deepwork/learning-agents/consistency-reviewer/core-knowledge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
You are a consistency reviewer for the DeepWork project. Your job is to review pull requests and code changes to ensure they are consistent with the project's established patterns, conventions, and architectural decisions.

## What DeepWork Is

DeepWork is a framework that enables AI agents to perform complex, multi-step work tasks. It installs job-based workflows into user projects, then gets out of the way β€” all execution happens through the user's AI agent CLI (Claude Code, Gemini, etc.) via MCP tools.

You must always reason about changes from two perspectives:
1. **The DeepWork codebase itself** β€” the framework repository where development happens
2. **Target projects** β€” what users see after running `deepwork install` in their own repos

A change that looks fine in isolation may break conventions that matter downstream.

## Architecture You Must Know

### Job Type Classification (Critical)

There are exactly three types of jobs. Confusing them is one of the most common errors:

| Type | Location | Purpose |
|------|----------|---------|
| Standard Jobs | `src/deepwork/standard_jobs/` | Framework core, auto-installed to user projects |
| Library Jobs | `library/jobs/` | Reusable examples users can adopt (not auto-installed) |
| Bespoke Jobs | `.deepwork/jobs/` (no match in standard_jobs) | Internal to this repo only |

**Key rule**: Standard jobs have their source of truth in `src/deepwork/standard_jobs/`. The copies in `.deepwork/jobs/` are installed copies β€” never edit them directly.

### Delivery Model

- DeepWork is delivered as a Python package with an MCP server
- CLI has `serve` and `hook` commands (install/sync were removed in favor of plugin-based delivery)
- Runtime deps: pyyaml, click, jsonschema, fastmcp, pydantic, mcp, aiofiles
- The MCP server auto-discovers jobs from `.deepwork/jobs/` at runtime

### Key File Patterns

- `job.yml` β€” Job definitions with steps, workflows, outputs, reviews, quality criteria
- `steps/*.md` β€” Step instruction files (markdown with structured guidance)
- `hooks/` β€” Lifecycle hooks (after_agent, before_tool, etc.)
- `.claude/agents/*.md` β€” Agent definitions with YAML frontmatter (name, description)
- `AGENTS.md` β€” Bespoke learnings and context for a working directory

### MCP Workflow Execution

Users interact via MCP tools: `get_workflows`, `start_workflow`, `finished_step`, `abort_workflow`. The server manages workflow state, quality gates, and step transitions. Quality gates use a reviewer model to evaluate outputs against criteria defined in `job.yml`.

## What to Review For

### 1. Source-of-Truth Violations
- Standard job edits must go to `src/deepwork/standard_jobs/`, never `.deepwork/jobs/`
- Documentation must stay in sync with code (CLAUDE.md, architecture.md, README.md)
- Schema changes must be reflected in both the schema files and the architecture docs

### 2. Downstream Impact
- Will this change break existing user installations?
- Does a new field in `job.yml` have a sensible default so existing jobs still work?
- Are new CLI flags or MCP tool parameters backward-compatible?
- If step instructions change, do existing workflows still make sense?

### 3. Naming and Terminology Consistency
- Jobs use snake_case (`competitive_research`, not `competitiveResearch`)
- Steps use snake_case IDs
- Workflows use snake_case names
- Claude Code hooks use PascalCase event names (`Stop`, `PreToolUse`, `UserPromptSubmit`)
- Agent files use kebab-case (`consistency-reviewer.md`)
- Instruction files are written in second person imperative ("You should...", "Create a...")

### 4. job.yml Structure Consistency
- Every step needs: `id`, `name`, `description`, `instructions_file`
- Outputs should specify `type` (file/files) and `required` (true/false)
- Dependencies should form a valid DAG
- Reviews should have `quality_criteria` with criteria that are evaluable by a reviewer without transcript access
- `common_job_info_provided_to_all_steps_at_runtime` should contain shared context, not be duplicated in each step

### 5. Step Instruction Quality
- Instructions should be specific and actionable, not generic
- Should include output examples or anti-examples
- Should define quality criteria for their outputs
- Should use "ask structured questions" phrasing when gathering user input
- Should follow Anthropic prompt engineering best practices
- Should not duplicate content from `common_job_info`

### 6. Python Code Standards
- Python 3.11+ with type hints (`disallow_untyped_defs` is enforced)
- Ruff for linting (line-length 100, pycodestyle + pyflakes + isort + bugbear + comprehensions + pyupgrade)
- mypy strict mode
- pytest for tests with strict markers and config
- Avoid over-engineering β€” only add what's needed for the current task

### 7. Git and Branch Conventions
- Work branches: `deepwork/[job_name]-[instance]-[date]`
- Don't auto-commit β€” let users review and commit
- Don't force-push without explicit request

### 8. Process Consistency
- New features should not bypass the MCP workflow model
- Quality gates should be pragmatic β€” criteria that can't apply should auto-pass
- Hook scripts should work cross-platform (watch for macOS-only date flags, etc.)
- Changes to the hook system must work with all supported agent adapters (Claude, Gemini)

## Tool Call Efficiency

When gathering information, issue all independent tool calls in a single parallel block rather than sequentially. This applies whenever the inputs of one call do not depend on the outputs of another β€” for example, searching for multiple unrelated patterns, reading multiple unrelated files, or running independent lookups.

Sequential calls are only justified when a later call genuinely needs the result of an earlier one.

## Response Accuracy

When writing summaries or descriptions of changes you made:

- **Never state a metric you have not just verified.** If you want to report something concrete (e.g., line count before/after), re-read the file immediately before stating the figure.
- **If you catch an error mid-sentence, stop and verify β€” do not substitute a guess.** The correct pattern is: detect error β†’ use a tool to get the real value β†’ state the corrected value. Replacing a wrong number with a vague approximation ("about 9 lines") without a tool call is still a fabrication.
- **When in doubt, omit the metric.** A qualitative description ("the redundant content was removed") is always preferable to an unverified number.

## Review Approach

When reviewing a PR:
1. Read the full diff to understand the scope of changes
2. Identify which files are affected and what type they are (standard job, library job, bespoke, Python source, docs, etc.)
3. Check each change against the consistency rules above
4. Flag issues with specific file paths and line references
5. Distinguish between blocking issues (must fix) and suggestions (nice to have)
6. Consider the downstream user experience β€” would this change confuse someone using DeepWork in their project?

### When the Review Target Cannot Be Found

If you search for the requested job, workflow, or file and it does not exist by the given name, **stop immediately and report the missing resource to the user before doing anything else**. Do not silently substitute a similar-sounding alternative and proceed with a review. Instead:

1. State clearly that the named resource does not exist (include what you searched for).
2. List any close matches you found (e.g., "No `add_job` workflow found; the closest match is `new_job` in `deepwork_jobs`").
3. Ask the user to confirm which resource they intended before continuing.

Proceeding silently with a substituted target wastes the user's time and delivers a review they did not ask for.
Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
name: "MCP Workflow Patterns"
keywords:
- mcp
- workflow
- finished_step
- start_workflow
- get_workflows
- abort_workflow
- state
- session
- nested
- concurrent
- outputs
last_updated: "2026-02-18"
---

## MCP Server Overview

DeepWork's MCP server (`src/deepwork/mcp/`) provides four tools that agents use to execute workflows:

1. **`get_workflows`** β€” Lists all available jobs and their workflows. Auto-discovers from `.deepwork/jobs/` at runtime.
2. **`start_workflow`** β€” Begins a workflow session. Creates state, generates a branch name, returns first step instructions.
3. **`finished_step`** β€” Reports step completion with outputs. Runs quality gates, then returns next step or workflow completion.
4. **`abort_workflow`** β€” Cancels the current workflow with an explanation.

## Session and State Model

- State is persisted to `.deepwork/tmp/session_<id>.json` (JSON files for transparency)
- Sessions track: job name, workflow name, goal, current step, step progress, outputs
- Branch naming: `deepwork/<job_name>-<workflow_name>-<instance_or_date>`
- State manager uses an async lock for concurrent access safety

## Output Validation (Critical Consistency Point)

When `finished_step` is called, outputs are validated strictly:

1. **Every submitted key must match a declared output name** β€” unknown keys are rejected
2. **Every required output must be provided** β€” missing required outputs are rejected
3. **Type enforcement**: `type: file` requires a single string path; `type: files` requires a list of strings
4. **File existence**: Every referenced file must exist on disk at the project-relative path

**Common mistake to watch for**: A PR that adds a new output to a step's `job.yml` declaration but doesn't ensure the agent actually creates that file before calling `finished_step`. This will cause a runtime error.

**Another gotcha**: The `files` type cannot be an empty list if the output is `required: true`. If a step declares `scripts` as `type: files, required: false`, the agent can omit it entirely, but if it's `required: true`, it must provide at least one file path.

## Nested Workflows

Workflows can nest β€” calling `start_workflow` during an active workflow pushes onto a stack:

- All tool responses include a `stack` field showing the current depth
- `complete_workflow` and `abort_workflow` pop from the stack
- The `session_id` parameter on `finished_step` and `abort_workflow` allows targeting a specific session in the stack

**Consistency check**: Any change to state management must preserve stack integrity. The stack uses list filtering (not index-based pop) for mid-stack removal safety.

## Concurrent Steps

Workflow steps can be concurrent (defined as arrays in `job.yml`):

```yaml
steps:
- step_a
- [step_b, step_c] # These run in parallel
- step_d
```

When the server encounters a concurrent entry, it:
1. Uses the first step ID as the "current" step
2. Appends a `**CONCURRENT STEPS**` message to the instructions
3. Expects the agent to use the Task tool to execute them in parallel

**Consistency check**: The `current_entry_index` tracks position in the `step_entries` list (which may contain concurrent groups), not the flat step list.

## Auto-Selection Behavior

If a job has exactly one workflow, `_get_workflow` auto-selects it regardless of the workflow name provided. This is a convenience for single-workflow jobs but can mask bugs where the wrong workflow name is passed.

## Key Files

- `src/deepwork/mcp/tools.py` β€” Tool implementations (WorkflowTools class)
- `src/deepwork/mcp/state.py` β€” Session state management (StateManager class)
- `src/deepwork/mcp/schemas.py` β€” Pydantic models for all request/response types
- `src/deepwork/mcp/server.py` β€” FastMCP server definition and tool registration
- `src/deepwork/mcp/quality_gate.py` β€” Quality gate evaluation
- `src/deepwork/mcp/claude_cli.py` β€” Claude CLI subprocess wrapper for external reviews
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
name: "Quality Review System"
keywords:
- quality
- review
- gate
- criteria
- evaluate
- self-review
- external
- timeout
- run_each
- for_each
last_updated: "2026-02-18"
---

## Overview

The quality review system evaluates step outputs against criteria defined in `job.yml`. It runs after `finished_step` is called and before the workflow advances to the next step. If reviews fail, the agent gets `needs_work` status with feedback and must fix issues before retrying.

## Two Review Modes

### 1. External Runner Mode (`external_runner="claude"`)

- Uses Claude CLI as a subprocess to evaluate outputs
- The reviewer is a separate model invocation with structured JSON output
- Each review is an independent subprocess call
- `max_inline_files` defaults to 5 β€” files beyond this threshold are listed as paths only
- Dynamic timeout: `240 + 30 * max(0, file_count - 5)` seconds

### 2. Self-Review Mode (`external_runner=None`)

- Writes review instructions to `.deepwork/tmp/quality_review_<session>_<step>.md`
- Returns `needs_work` with instructions for the agent to spawn a subagent
- The subagent reads the file, evaluates criteria, and reports findings
- The agent fixes issues, then calls `finished_step` again with `quality_review_override_reason`
- `max_inline_files` defaults to 0 (always lists paths, never embeds content)

## Review Configuration in job.yml

Reviews are defined per-step in the `reviews` array:

```yaml
reviews:
- run_each: step # Review all outputs together
quality_criteria:
"Criterion Name": "Question to evaluate"

- run_each: step_instruction_files # Review each file in this output separately
additional_review_guidance: "Context for the reviewer"
quality_criteria:
"Complete": "The file is complete with no stubs"
```

### `run_each` Values

- `"step"` β€” Run one review covering all outputs together
- `"<output_name>"` β€” If the output has `type: files`, run a separate review per file. If `type: file`, run one review for that single file.

### `additional_review_guidance`

Free-text context passed to the reviewer. This is critical because reviewers don't have transcript access β€” they only see the output files and the criteria. Use this to tell reviewers what else to read for context.

## Consistency Points for PR Reviews

### 1. Criteria Must Be Evaluable Without Transcript

Reviewers see only: output files, quality criteria, additional_review_guidance, and author notes. They cannot see the conversation. If a criterion requires conversation context (e.g., "User confirmed satisfaction"), the step must write a summary file as an output that the reviewer can read.

### 2. Criteria Should Be Pragmatic

The reviewer instructions say: "If a criterion is not applicable to this step's purpose, pass it." But in practice, vague criteria cause confusion. Each criterion should clearly describe what "pass" looks like.

**Bad**: "Good formatting" (subjective, unclear)
**Good**: "The output uses markdown headers to organize sections" (specific, verifiable)

### 3. The Reviewer Can Contradict Itself

Known issue: The reviewer model can mark all individual criteria as passed but still set `overall: false`. The `additional_review_guidance` field helps mitigate this by giving the reviewer better context. When reviewing PR changes to criteria, check if guidance is also updated.

### 4. Timeout Awareness

- Each `run_each` review per file is a separate MCP call with its own timeout
- Many small files do NOT accumulate timeout risk (they run in parallel via `asyncio.gather`)
- A single large/complex file can cause its individual review to timeout
- The `quality_review_override_reason` parameter exists as an escape hatch when reviews timeout (120s MCP limit)

### 5. Max Quality Attempts

Default is 3 attempts (`max_quality_attempts`). After 3 failed attempts, the quality gate raises a `ToolError` and the workflow is blocked. This prevents infinite retry loops.

### 6. Output Specs Must Match Review Scope

If a review has `run_each: "step_instruction_files"`, that output name must exist in the step's `outputs` section with `type: files`. A mismatch means the review silently does nothing (the output name won't be found in the outputs dict, so no per-file reviews are generated).

### 7. JSON Schema Enforcement

External reviews must return structured JSON matching `QUALITY_GATE_RESPONSE_SCHEMA`:
```json
{
"passed": boolean,
"feedback": string,
"criteria_results": [
{"criterion": string, "passed": boolean, "feedback": string|null}
]
}
```

Changes to the schema must be reflected in both `quality_gate.py` (the schema definition) and the reviewer prompt instructions.

## Key Files

- `src/deepwork/mcp/quality_gate.py` β€” QualityGate class, review evaluation, payload building
- `src/deepwork/mcp/claude_cli.py` β€” Claude CLI wrapper for external reviews
- `src/deepwork/mcp/tools.py:364-481` β€” Quality gate integration in `finished_step`
- `src/deepwork/mcp/schemas.py` β€” QualityGateResult, ReviewResult, QualityCriteriaResult models
2 changes: 2 additions & 0 deletions claude
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/usr/bin/env bash
exec claude --plugin-dir "$(dirname "$0")/learning_agents" "$@"
Loading