Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,8 @@
"Skill(manual_tests.run_fire_tests)",
"Skill(deepwork_rules)",
"Skill(deepwork_rules.define)",
"Bash(deepwork rules clear_queue)"
"Bash(deepwork rules clear_queue)",
"Bash(rm -rf .deepwork/tmp/rules/queue/*.json)"
]
},
"hooks": {
Expand Down
145 changes: 96 additions & 49 deletions .claude/skills/manual_tests.run_fire_tests/SKILL.md

Large diffs are not rendered by default.

104 changes: 71 additions & 33 deletions .claude/skills/manual_tests.run_not_fire_tests/SKILL.md

Large diffs are not rendered by default.

22 changes: 19 additions & 3 deletions .claude/skills/manual_tests/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,27 @@ This job tests that rules fire when they should AND do not fire when they should
Each test is run in a SUB-AGENT (not the main agent) because:
1. Sub-agents run in isolated contexts where file changes can be detected
2. The Stop hook automatically evaluates rules when each sub-agent completes
3. The main agent can observe whether hooks fired without triggering them manually
3. Sub-agents report results via MAGIC STRINGS that the main agent checks

MAGIC STRING DETECTION: Sub-agents output:
- "TASK_START: <task name>" - ALWAYS at the start of their response
- "HOOK_FIRED: <rule name>" - If a DeepWork hook blocks them
Detection logic:
- TASK_START present + no HOOK_FIRED = hook did NOT fire
- HOOK_FIRED present = hook fired
- Neither present = timeout (hook blocking infinitely)

TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent
infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block),
treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire".

TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool
definitions). This is unavoidable baseline overhead for agents with Edit access.
Sub-agent prompts include efficiency instructions to minimize additional usage.

CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file
edits itself - it spawns sub-agents to make edits, then observes whether the hooks
fired automatically when those sub-agents returned.
edits itself - it spawns sub-agents to make edits, then checks the returned magic
strings to determine whether hooks fired.

Steps:
1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents
Expand Down
38 changes: 30 additions & 8 deletions .deepwork/jobs/manual_tests/job.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: manual_tests
version: "1.2.1"
version: "1.3.1"
summary: "Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly."
description: |
A workflow for running manual tests that validate DeepWork rules/hooks fire correctly.
Expand All @@ -8,11 +8,27 @@ description: |
Each test is run in a SUB-AGENT (not the main agent) because:
1. Sub-agents run in isolated contexts where file changes can be detected
2. The Stop hook automatically evaluates rules when each sub-agent completes
3. The main agent can observe whether hooks fired without triggering them manually
3. Sub-agents report results via MAGIC STRINGS that the main agent checks

MAGIC STRING DETECTION: Sub-agents output:
- "TASK_START: <task name>" - ALWAYS at the start of their response
- "HOOK_FIRED: <rule name>" - If a DeepWork hook blocks them
Detection logic:
- TASK_START present + no HOOK_FIRED = hook did NOT fire
- HOOK_FIRED present = hook fired
- Neither present = timeout (hook blocking infinitely)

TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent
infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block),
treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire".

TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool
definitions). This is unavoidable baseline overhead for agents with Edit access.
Sub-agent prompts include efficiency instructions to minimize additional usage.

CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file
edits itself - it spawns sub-agents to make edits, then observes whether the hooks
fired automatically when those sub-agents returned.
edits itself - it spawns sub-agents to make edits, then checks the returned magic
strings to determine whether hooks fired.

Steps:
1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents
Expand All @@ -28,6 +44,10 @@ description: |
- Created mode (new files only)

changelog:
- version: "1.3.1"
changes: "Added TOKEN OVERHEAD note explaining ~16k baseline cost; added 'Keep your response brief' efficiency instruction to sub-agent prompts"
- version: "1.3.0"
changes: "Major overhaul: Added TASK_START/HOOK_FIRED magic string detection; fixed all file names in prompts; added max_turns: 5 timeout; use deepwork rules clear_queue CLI"
- version: "1.2.1"
changes: "Fixed incomplete revert - now uses git reset HEAD to unstage files (rules_check stages with git add -A)"
- version: "1.2.0"
Expand All @@ -49,9 +69,10 @@ steps:
quality_criteria:
- "**Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly."
- "**Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)?"
- "**Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command."
- "**Task Parameters**: Did each Task call include `model: \"haiku\"` and `max_turns: 5`?"
- "**Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check."
- "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?"
- "**Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`?"
- "**Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`?"

- id: run_fire_tests
name: "Run Should-Fire Tests"
Expand All @@ -67,7 +88,8 @@ steps:
quality_criteria:
- "**Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly."
- "**Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination?"
- "**Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command."
- "**Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination?"
- "**Task Parameters**: Did each Task call include `model: \"haiku\"` and `max_turns: 5`?"
- "**Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check."
- "**Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination?"
- "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?"
- "**Results Recorded**: Did the main agent track pass/fail status for each test case?"
90 changes: 59 additions & 31 deletions .deepwork/jobs/manual_tests/steps/run_fire_tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Run all "should fire" tests in **serial** sub-agents to verify that rules fire c
**You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.**

Why sub-agents are required:
1. Sub-agents run in isolated contexts where file changes are detected
1. Sub-agents run in isolated contexts where file changes can be detected
2. When a sub-agent completes, the Stop hook **automatically** evaluates rules
3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them
3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired
4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent

**NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return.
Expand All @@ -28,24 +28,47 @@ Why serial execution is required:

## Task

Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically.
Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically by checking for magic strings.

### Process

**CRITICAL: Task Tool Parameters**

Each Task tool call MUST include:
- `model: "haiku"` - Use the fast model to minimize cost and latency
- `max_turns: 5` - Prevent sub-agents from hanging indefinitely

This limits each sub-agent to ~5 API round-trips. If a sub-agent hits the limit (e.g., stuck in infinite block without providing a promise), this confirms the hook IS firing and blocking them - treat it as test PASSED.

**CRITICAL: Magic String Instructions for Sub-Agents**

Every sub-agent prompt MUST include this instruction:
> "IMPORTANT: Start your response with exactly `TASK_START: <brief task description>`. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: <rule name>` in your response."

**How detection works:**
- Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response
- If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:`
- Main agent checks:
- `HOOK_FIRED:` present → hook fired (test PASSED)
- `TASK_START:` present + no `HOOK_FIRED:` → hook did NOT fire (test FAILED)
- Neither `TASK_START:` nor `HOOK_FIRED:` → timeout (test PASSED - confirms hook is blocking infinitely)

For EACH test below, follow this cycle:

1. **Launch a sub-agent** using the Task tool (use a fast model like haiku)
2. **Wait for the sub-agent to complete**
3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output
4. **If no visible blocking occurred, check the queue**:
1. **Launch a sub-agent** using the Task tool (set `model: "haiku"` and `max_turns: 5`)
2. **Wait for the sub-agent to complete (or hit max_turns limit)**
3. **Check the sub-agent's response for magic strings**:
- `HOOK_FIRED:` present = Hook fired successfully (test PASSED)
- `TASK_START:` present + no `HOOK_FIRED:` = Hook did NOT fire (test FAILED)
- Neither = Timeout/infinite block (test PASSED - confirms hook is blocking)
4. **If inconclusive, check the queue as a fallback**:
```bash
ls -la .deepwork/tmp/rules/queue/
cat .deepwork/tmp/rules/queue/*.json 2>/dev/null
```
- If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible
- If queue entries exist with status "queued", the hook DID fire
- If queue is empty, the hook did NOT fire at all
- Record the queue status along with the result
5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither
5. **Record the result** - pass if hook fired (magic string OR queue entry OR timeout), fail if `TASK_START` present without `HOOK_FIRED`
6. **Revert changes and clear queue** (MANDATORY after each test):
```bash
git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml
Expand All @@ -67,43 +90,43 @@ For EACH test below, follow this cycle:
### Test Cases (run serially)

**Test 1: Trigger/Safety**
- Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file."
- Expected: Hook fires with prompt about updating documentation
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit ONLY `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment. Do NOT edit the `_doc.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires with prompt about updating documentation → sub-agent returns `HOOK_FIRED:`

**Test 2: Set Mode**
- Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file."
- Expected: Hook fires with prompt about updating tests
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit ONLY `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment. Do NOT edit the `_test.py` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires with prompt about updating tests → sub-agent returns `HOOK_FIRED:`

**Test 3: Pair Mode**
- Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file."
- Expected: Hook fires with prompt about updating expected output
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment. Do NOT edit the `_expected.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires with prompt about updating expected output → sub-agent returns `HOOK_FIRED:`

**Test 4: Command Action**
- Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text."
- Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition)
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Command Action test`. Edit `manual_tests/test_command_action/test_command_action.txt` to add some text. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Command runs automatically, appending to the log file. NOTE: Command actions don't block, so sub-agent returns only `TASK_START:` - verify by checking the log file was appended to.

**Test 5: Multi Safety**
- Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)."
- Expected: Hook fires with prompt about updating safety documentation
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit ONLY `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment. Do NOT edit any of the safety files (`_changelog.md` or `_version.txt`). If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires with prompt about updating safety documentation → sub-agent returns `HOOK_FIRED:`

**Test 6: Infinite Block Prompt**
- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags."
- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires and BLOCKS with infinite prompt sub-agent returns `HOOK_FIRED:` or hits timeout

**Test 7: Infinite Block Command**
- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags."
- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires and command fails sub-agent returns `HOOK_FIRED:` or hits timeout

**Test 8: Created Mode**
- Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification."
- Expected: Hook fires with prompt about new configuration files
- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: <rule name>` in your response."
- Expected: Hook fires with prompt about new configuration files → sub-agent returns `HOOK_FIRED:`

### Results Tracking

Record the result after each test:

| Test Case | Should Fire | Visible Block? | Queue Entry? | Result |
|-----------|-------------|:--------------:|:------------:|:------:|
| Test Case | Should Fire | Magic String | Queue Entry? | Result |
|-----------|-------------|:------------:|:------------:|:------:|
| Trigger/Safety | Edit .py only | | | |
| Set Mode | Edit _source.py only | | | |
| Pair Mode | Edit _trigger.py only | | | |
Expand All @@ -113,7 +136,12 @@ Record the result after each test:
| Infinite Block Command | Edit .py (no promise) | | | |
| Created Mode | Create NEW .yml | | | |

**Queue Entry Status Guide:**
**Magic String Guide:**
- `HOOK_FIRED:` in response → Hook fired successfully (test PASSED)
- `TASK_START:` present + no `HOOK_FIRED:` → Hook did NOT fire (test FAILED, except for Command Action)
- Neither present (timeout) → Hook is blocking infinitely (test PASSED - confirms hook fired)

**Queue Entry Status Guide (fallback):**
- If queue has entry with status "queued" → Hook fired, rule was shown to agent
- If queue has entry with status "passed" → Hook fired, rule was satisfied
- If queue is empty → Hook did NOT fire
Expand All @@ -123,8 +151,8 @@ Record the result after each test:
- **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly
- **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel
- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after each test
- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY
- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned
- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` and `HOOK_FIRED:` - did NOT manually run rules_check
- **Hooks fired correctly**: For each test, sub-agent returned `HOOK_FIRED:` or timed out (indicating the rule was triggered)
- **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported
- **Results recorded**: Pass/fail status was recorded for each test run
- When all criteria are met, include `<promise>✓ Quality Criteria Met</promise>` in your response
Expand Down
Loading