diff --git a/.claude/settings.json b/.claude/settings.json index 33ef2b48..c6ea70cb 100644 --- a/.claude/settings.json +++ b/.claude/settings.json @@ -120,7 +120,8 @@ "Skill(manual_tests.run_fire_tests)", "Skill(deepwork_rules)", "Skill(deepwork_rules.define)", - "Bash(deepwork rules clear_queue)" + "Bash(deepwork rules clear_queue)", + "Bash(rm -rf .deepwork/tmp/rules/queue/*.json)" ] }, "hooks": { diff --git a/.claude/skills/manual_tests.run_fire_tests/SKILL.md b/.claude/skills/manual_tests.run_fire_tests/SKILL.md index 86edc039..211889d8 100644 --- a/.claude/skills/manual_tests.run_fire_tests/SKILL.md +++ b/.claude/skills/manual_tests.run_fire_tests/SKILL.md @@ -13,10 +13,11 @@ hooks: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? - 3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. - 4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? - 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 6. **Results Recorded**: Did the main agent track pass/fail status for each test case? + 3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? + 4. **Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check. + 5. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## Instructions @@ -38,10 +39,11 @@ hooks: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? - 3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. - 4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? - 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 6. **Results Recorded**: Did the main agent track pass/fail status for each test case? + 3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? + 4. **Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check. + 5. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## Instructions @@ -81,9 +83,9 @@ Run all "should fire" tests in **serial** sub-agents to verify that rules fire c **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. @@ -100,34 +102,57 @@ Why serial execution is required: ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically by checking for magic strings. ### Process +**CRITICAL: Task Tool Parameters** + +Each Task tool call MUST include: +- `model: "haiku"` - Use the fast model to minimize cost and latency +- `max_turns: 5` - Prevent sub-agents from hanging indefinitely + +This limits each sub-agent to ~5 API round-trips. If a sub-agent hits the limit (e.g., stuck in infinite block without providing a promise), this confirms the hook IS firing and blocking them - treat it as test PASSED. + +**CRITICAL: Magic String Instructions for Sub-Agents** + +Every sub-agent prompt MUST include this instruction: +> "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + +**How detection works:** +- Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response +- If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` +- Main agent checks: + - `HOOK_FIRED:` present → hook fired (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` → hook did NOT fire (test FAILED) + - Neither `TASK_START:` nor `HOOK_FIRED:` → timeout (test PASSED - confirms hook is blocking infinitely) + For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) -2. **Wait for the sub-agent to complete** -3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output -4. **If no visible blocking occurred, check the queue**: +1. **Launch a sub-agent** using the Task tool (set `model: "haiku"` and `max_turns: 5`) +2. **Wait for the sub-agent to complete (or hit max_turns limit)** +3. **Check the sub-agent's response for magic strings**: + - `HOOK_FIRED:` present = Hook fired successfully (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` = Hook did NOT fire (test FAILED) + - Neither = Timeout/infinite block (test PASSED - confirms hook is blocking) +4. **If inconclusive, check the queue as a fallback**: ```bash ls -la .deepwork/tmp/rules/queue/ cat .deepwork/tmp/rules/queue/*.json 2>/dev/null ``` - - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue entries exist with status "queued", the hook DID fire - If queue is empty, the hook did NOT fire at all - - Record the queue status along with the result -5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither +5. **Record the result** - pass if hook fired (magic string OR queue entry OR timeout), fail if `TASK_START` present without `HOOK_FIRED` 6. **Revert changes and clear queue** (MANDATORY after each test): ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` **Why this command sequence**: - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - `rm -f ...` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again 7. **Check for early termination**: If **2 tests have now failed**, immediately: - Stop running any remaining tests - Report the results summary showing which tests passed/failed @@ -139,43 +164,43 @@ For EACH test below, follow this cycle: ### Test Cases (run serially) **Test 1: Trigger/Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." -- Expected: Hook fires with prompt about updating documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit ONLY `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment. Do NOT edit the `_doc.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating documentation → sub-agent returns `HOOK_FIRED:` **Test 2: Set Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." -- Expected: Hook fires with prompt about updating tests +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit ONLY `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment. Do NOT edit the `_test.py` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating tests → sub-agent returns `HOOK_FIRED:` **Test 3: Pair Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." -- Expected: Hook fires with prompt about updating expected output +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment. Do NOT edit the `_expected.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating expected output → sub-agent returns `HOOK_FIRED:` **Test 4: Command Action** -- Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." -- Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Command Action test`. Edit `manual_tests/test_command_action/test_command_action.txt` to add some text. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Command runs automatically, appending to the log file. NOTE: Command actions don't block, so sub-agent returns only `TASK_START:` - verify by checking the log file was appended to. **Test 5: Multi Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." -- Expected: Hook fires with prompt about updating safety documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit ONLY `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment. Do NOT edit any of the safety files (`_changelog.md` or `_version.txt`). If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating safety documentation → sub-agent returns `HOOK_FIRED:` **Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and BLOCKS with infinite prompt → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and command fails → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 8: Created Mode** -- Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." -- Expected: Hook fires with prompt about new configuration files +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about new configuration files → sub-agent returns `HOOK_FIRED:` ### Results Tracking Record the result after each test: -| Test Case | Should Fire | Visible Block? | Queue Entry? | Result | -|-----------|-------------|:--------------:|:------------:|:------:| +| Test Case | Should Fire | Magic String | Queue Entry? | Result | +|-----------|-------------|:------------:|:------------:|:------:| | Trigger/Safety | Edit .py only | | | | | Set Mode | Edit _source.py only | | | | | Pair Mode | Edit _trigger.py only | | | | @@ -185,7 +210,12 @@ Record the result after each test: | Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | -**Queue Entry Status Guide:** +**Magic String Guide:** +- `HOOK_FIRED:` in response → Hook fired successfully (test PASSED) +- `TASK_START:` present + no `HOOK_FIRED:` → Hook did NOT fire (test FAILED, except for Command Action) +- Neither present (timeout) → Hook is blocking infinitely (test PASSED - confirms hook fired) + +**Queue Entry Status Guide (fallback):** - If queue has entry with status "queued" → Hook fired, rule was shown to agent - If queue has entry with status "passed" → Hook fired, rule was satisfied - If queue is empty → Hook did NOT fire @@ -194,9 +224,9 @@ Record the result after each test: - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel -- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after each test +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` and `HOOK_FIRED:` - did NOT manually run rules_check +- **Hooks fired correctly**: For each test, sub-agent returned `HOOK_FIRED:` or timed out (indicating the rule was triggered) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported - **Results recorded**: Pass/fail status was recorded for each test run - When all criteria are met, include `✓ Quality Criteria Met` in your response @@ -218,11 +248,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents @@ -270,10 +316,11 @@ Stop hooks will automatically validate your work. The loop continues until all c **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? -3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. -4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? -5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -6. **Results Recorded**: Did the main agent track pass/fail status for each test case? +3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? +4. **Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check. +5. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Results Recorded**: Did the main agent track pass/fail status for each test case? **To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. diff --git a/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md b/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md index 2597c0f3..e41120eb 100644 --- a/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md +++ b/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md @@ -13,9 +13,10 @@ hooks: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? - 3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. - 4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? + 3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? + 4. **Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check. + 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 6. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`? ## Instructions @@ -37,9 +38,10 @@ hooks: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? - 3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. - 4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? + 3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? + 4. **Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check. + 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 6. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`? ## Instructions @@ -75,48 +77,67 @@ Run all "should NOT fire" tests in parallel sub-agents to verify that rules do n **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 8 "should NOT fire" tests in **parallel** sub-agents, then check each sub-agent's response for magic strings to determine pass/fail. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). + + **CRITICAL: Task Tool Parameters** + + Each Task tool call MUST include: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely + + This limits each sub-agent to ~5 API round-trips, which is plenty for these simple edit tasks. If a sub-agent hits the limit, treat it as a timeout/failure. + + **CRITICAL: Magic String Instructions for Sub-Agents** + + Every sub-agent prompt MUST include this instruction: + > "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + + **How detection works:** + - Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response + - If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` + - Main agent checks: `TASK_START` present + no `HOOK_FIRED` = hook did NOT fire **Sub-agent prompts (launch all 8 in parallel):** - a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." + a. **Trigger/Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode_doc.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - b. **Set Mode test** - "Edit `manual_tests/test_set_mode/module_source.py` to add a comment, AND edit `manual_tests/test_set_mode/module_test.py` to add a test comment. Both files must be edited so the rule does NOT fire." + b. **Set Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment, AND edit `manual_tests/test_set_mode/test_set_mode_test.py` to add a test comment. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - c. **Pair Mode (forward) test** - "Edit `manual_tests/test_pair_mode/handler_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/handler_expected.md` to add a note. Both files must be edited so the rule does NOT fire." + c. **Pair Mode (forward) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode forward test`. Edit `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - d. **Pair Mode (reverse) test** - "Edit ONLY `manual_tests/test_pair_mode/handler_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction." + d. **Pair Mode (reverse) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode reverse test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." + e. **Multi Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment, AND edit `manual_tests/test_multi_safety/test_multi_safety_changelog.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + f. **Infinite Block Prompt test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Include `Manual Test: Infinite Block Prompt` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + g. **Infinite Block Command test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Include `Manual Test: Infinite Block Command` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + h. **Created Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Modify the EXISTING file `manual_tests/test_created_mode/existing_file.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." -2. **Observe the results** +2. **Check the results using magic strings** - When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire - - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have + When each sub-agent returns, check their response for magic strings: + - **If `TASK_START:` present AND no `HOOK_FIRED:`**: The test PASSED - the rule correctly did NOT fire + - **If `HOOK_FIRED:` present**: The test FAILED - investigate why the rule fired when it shouldn't have + - **If neither `TASK_START:` nor `HOOK_FIRED:`**: The test is INCONCLUSIVE (timeout or sub-agent didn't follow instructions) - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You determine pass/fail by checking for magic strings in the sub-agent's response. Do NOT run any verification commands manually. 3. **Record the results and check for early termination** @@ -146,22 +167,22 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo Run these commands to clean up: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` **Why this command sequence**: - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again ## Quality Criteria - **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` (present) and `HOOK_FIRED:` (absent) - did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after tests completed (regardless of pass/fail) +- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after tests completed (regardless of pass/fail) - When all criteria are met, include `✓ Quality Criteria Met` in your response ## Reference @@ -181,11 +202,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents @@ -228,9 +265,10 @@ Stop hooks will automatically validate your work. The loop continues until all c **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? -3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. -4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? +3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? +4. **Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check. +5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +6. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`? **To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. diff --git a/.claude/skills/manual_tests/SKILL.md b/.claude/skills/manual_tests/SKILL.md index bf97b88a..1907cf4f 100644 --- a/.claude/skills/manual_tests/SKILL.md +++ b/.claude/skills/manual_tests/SKILL.md @@ -15,11 +15,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents diff --git a/.deepwork/jobs/manual_tests/job.yml b/.deepwork/jobs/manual_tests/job.yml index b2662c2c..04c02e21 100644 --- a/.deepwork/jobs/manual_tests/job.yml +++ b/.deepwork/jobs/manual_tests/job.yml @@ -1,5 +1,5 @@ name: manual_tests -version: "1.2.1" +version: "1.3.1" summary: "Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly." description: | A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. @@ -8,11 +8,27 @@ description: | Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes - 3. The main agent can observe whether hooks fired without triggering them manually + 3. Sub-agents report results via MAGIC STRINGS that the main agent checks + + MAGIC STRING DETECTION: Sub-agents output: + - "TASK_START: " - ALWAYS at the start of their response + - "HOOK_FIRED: " - If a DeepWork hook blocks them + Detection logic: + - TASK_START present + no HOOK_FIRED = hook did NOT fire + - HOOK_FIRED present = hook fired + - Neither present = timeout (hook blocking infinitely) + + TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent + infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), + treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + + TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool + definitions). This is unavoidable baseline overhead for agents with Edit access. + Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file - edits itself - it spawns sub-agents to make edits, then observes whether the hooks - fired automatically when those sub-agents returned. + edits itself - it spawns sub-agents to make edits, then checks the returned magic + strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents @@ -28,6 +44,10 @@ description: | - Created mode (new files only) changelog: + - version: "1.3.1" + changes: "Added TOKEN OVERHEAD note explaining ~16k baseline cost; added 'Keep your response brief' efficiency instruction to sub-agent prompts" + - version: "1.3.0" + changes: "Major overhaul: Added TASK_START/HOOK_FIRED magic string detection; fixed all file names in prompts; added max_turns: 5 timeout; use deepwork rules clear_queue CLI" - version: "1.2.1" changes: "Fixed incomplete revert - now uses git reset HEAD to unstage files (rules_check stages with git add -A)" - version: "1.2.0" @@ -49,9 +69,10 @@ steps: quality_criteria: - "**Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly." - "**Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)?" - - "**Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command." + - "**Task Parameters**: Did each Task call include `model: \"haiku\"` and `max_turns: 5`?" + - "**Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check." - "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?" - - "**Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`?" + - "**Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`?" - id: run_fire_tests name: "Run Should-Fire Tests" @@ -67,7 +88,8 @@ steps: quality_criteria: - "**Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly." - "**Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination?" - - "**Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command." - - "**Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination?" + - "**Task Parameters**: Did each Task call include `model: \"haiku\"` and `max_turns: 5`?" + - "**Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check." + - "**Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination?" - "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?" - "**Results Recorded**: Did the main agent track pass/fail status for each test case?" diff --git a/.deepwork/jobs/manual_tests/steps/run_fire_tests.md b/.deepwork/jobs/manual_tests/steps/run_fire_tests.md index 27f3cfc8..8193b6cb 100644 --- a/.deepwork/jobs/manual_tests/steps/run_fire_tests.md +++ b/.deepwork/jobs/manual_tests/steps/run_fire_tests.md @@ -9,9 +9,9 @@ Run all "should fire" tests in **serial** sub-agents to verify that rules fire c **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. @@ -28,24 +28,47 @@ Why serial execution is required: ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically by checking for magic strings. ### Process +**CRITICAL: Task Tool Parameters** + +Each Task tool call MUST include: +- `model: "haiku"` - Use the fast model to minimize cost and latency +- `max_turns: 5` - Prevent sub-agents from hanging indefinitely + +This limits each sub-agent to ~5 API round-trips. If a sub-agent hits the limit (e.g., stuck in infinite block without providing a promise), this confirms the hook IS firing and blocking them - treat it as test PASSED. + +**CRITICAL: Magic String Instructions for Sub-Agents** + +Every sub-agent prompt MUST include this instruction: +> "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + +**How detection works:** +- Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response +- If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` +- Main agent checks: + - `HOOK_FIRED:` present → hook fired (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` → hook did NOT fire (test FAILED) + - Neither `TASK_START:` nor `HOOK_FIRED:` → timeout (test PASSED - confirms hook is blocking infinitely) + For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) -2. **Wait for the sub-agent to complete** -3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output -4. **If no visible blocking occurred, check the queue**: +1. **Launch a sub-agent** using the Task tool (set `model: "haiku"` and `max_turns: 5`) +2. **Wait for the sub-agent to complete (or hit max_turns limit)** +3. **Check the sub-agent's response for magic strings**: + - `HOOK_FIRED:` present = Hook fired successfully (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` = Hook did NOT fire (test FAILED) + - Neither = Timeout/infinite block (test PASSED - confirms hook is blocking) +4. **If inconclusive, check the queue as a fallback**: ```bash ls -la .deepwork/tmp/rules/queue/ cat .deepwork/tmp/rules/queue/*.json 2>/dev/null ``` - - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue entries exist with status "queued", the hook DID fire - If queue is empty, the hook did NOT fire at all - - Record the queue status along with the result -5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither +5. **Record the result** - pass if hook fired (magic string OR queue entry OR timeout), fail if `TASK_START` present without `HOOK_FIRED` 6. **Revert changes and clear queue** (MANDATORY after each test): ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml @@ -67,43 +90,43 @@ For EACH test below, follow this cycle: ### Test Cases (run serially) **Test 1: Trigger/Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." -- Expected: Hook fires with prompt about updating documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit ONLY `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment. Do NOT edit the `_doc.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating documentation → sub-agent returns `HOOK_FIRED:` **Test 2: Set Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." -- Expected: Hook fires with prompt about updating tests +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit ONLY `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment. Do NOT edit the `_test.py` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating tests → sub-agent returns `HOOK_FIRED:` **Test 3: Pair Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." -- Expected: Hook fires with prompt about updating expected output +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment. Do NOT edit the `_expected.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating expected output → sub-agent returns `HOOK_FIRED:` **Test 4: Command Action** -- Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." -- Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Command Action test`. Edit `manual_tests/test_command_action/test_command_action.txt` to add some text. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Command runs automatically, appending to the log file. NOTE: Command actions don't block, so sub-agent returns only `TASK_START:` - verify by checking the log file was appended to. **Test 5: Multi Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." -- Expected: Hook fires with prompt about updating safety documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit ONLY `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment. Do NOT edit any of the safety files (`_changelog.md` or `_version.txt`). If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating safety documentation → sub-agent returns `HOOK_FIRED:` **Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and BLOCKS with infinite prompt → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and command fails → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 8: Created Mode** -- Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." -- Expected: Hook fires with prompt about new configuration files +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about new configuration files → sub-agent returns `HOOK_FIRED:` ### Results Tracking Record the result after each test: -| Test Case | Should Fire | Visible Block? | Queue Entry? | Result | -|-----------|-------------|:--------------:|:------------:|:------:| +| Test Case | Should Fire | Magic String | Queue Entry? | Result | +|-----------|-------------|:------------:|:------------:|:------:| | Trigger/Safety | Edit .py only | | | | | Set Mode | Edit _source.py only | | | | | Pair Mode | Edit _trigger.py only | | | | @@ -113,7 +136,12 @@ Record the result after each test: | Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | -**Queue Entry Status Guide:** +**Magic String Guide:** +- `HOOK_FIRED:` in response → Hook fired successfully (test PASSED) +- `TASK_START:` present + no `HOOK_FIRED:` → Hook did NOT fire (test FAILED, except for Command Action) +- Neither present (timeout) → Hook is blocking infinitely (test PASSED - confirms hook fired) + +**Queue Entry Status Guide (fallback):** - If queue has entry with status "queued" → Hook fired, rule was shown to agent - If queue has entry with status "passed" → Hook fired, rule was satisfied - If queue is empty → Hook did NOT fire @@ -123,8 +151,8 @@ Record the result after each test: - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel - **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` and `HOOK_FIRED:` - did NOT manually run rules_check +- **Hooks fired correctly**: For each test, sub-agent returned `HOOK_FIRED:` or timed out (indicating the rule was triggered) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported - **Results recorded**: Pass/fail status was recorded for each test run - When all criteria are met, include `✓ Quality Criteria Met` in your response diff --git a/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md b/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md index 2fb25975..dc141d10 100644 --- a/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md +++ b/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md @@ -9,48 +9,67 @@ Run all "should NOT fire" tests in parallel sub-agents to verify that rules do n **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 8 "should NOT fire" tests in **parallel** sub-agents, then check each sub-agent's response for magic strings to determine pass/fail. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). + + **CRITICAL: Task Tool Parameters** + + Each Task tool call MUST include: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely + + This limits each sub-agent to ~5 API round-trips, which is plenty for these simple edit tasks. If a sub-agent hits the limit, treat it as a timeout/failure. + + **CRITICAL: Magic String Instructions for Sub-Agents** + + Every sub-agent prompt MUST include this instruction: + > "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + + **How detection works:** + - Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response + - If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` + - Main agent checks: `TASK_START` present + no `HOOK_FIRED` = hook did NOT fire **Sub-agent prompts (launch all 8 in parallel):** - a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." + a. **Trigger/Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode_doc.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - b. **Set Mode test** - "Edit `manual_tests/test_set_mode/module_source.py` to add a comment, AND edit `manual_tests/test_set_mode/module_test.py` to add a test comment. Both files must be edited so the rule does NOT fire." + b. **Set Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment, AND edit `manual_tests/test_set_mode/test_set_mode_test.py` to add a test comment. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - c. **Pair Mode (forward) test** - "Edit `manual_tests/test_pair_mode/handler_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/handler_expected.md` to add a note. Both files must be edited so the rule does NOT fire." + c. **Pair Mode (forward) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode forward test`. Edit `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - d. **Pair Mode (reverse) test** - "Edit ONLY `manual_tests/test_pair_mode/handler_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction." + d. **Pair Mode (reverse) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode reverse test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." + e. **Multi Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment, AND edit `manual_tests/test_multi_safety/test_multi_safety_changelog.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + f. **Infinite Block Prompt test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Include `Manual Test: Infinite Block Prompt` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + g. **Infinite Block Command test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Include `Manual Test: Infinite Block Command` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + h. **Created Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Modify the EXISTING file `manual_tests/test_created_mode/existing_file.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." -2. **Observe the results** +2. **Check the results using magic strings** - When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire - - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have + When each sub-agent returns, check their response for magic strings: + - **If `TASK_START:` present AND no `HOOK_FIRED:`**: The test PASSED - the rule correctly did NOT fire + - **If `HOOK_FIRED:` present**: The test FAILED - investigate why the rule fired when it shouldn't have + - **If neither `TASK_START:` nor `HOOK_FIRED:`**: The test is INCONCLUSIVE (timeout or sub-agent didn't follow instructions) - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You determine pass/fail by checking for magic strings in the sub-agent's response. Do NOT run any verification commands manually. 3. **Record the results and check for early termination** @@ -93,7 +112,7 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo - **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` (present) and `HOOK_FIRED:` (absent) - did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported - **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after tests completed (regardless of pass/fail) - When all criteria are met, include `✓ Quality Criteria Met` in your response diff --git a/.deepwork/jobs/manual_tests/steps/test_reference.md b/.deepwork/jobs/manual_tests/steps/test_reference.md index 8247837a..adcbd09a 100644 --- a/.deepwork/jobs/manual_tests/steps/test_reference.md +++ b/.deepwork/jobs/manual_tests/steps/test_reference.md @@ -9,15 +9,40 @@ This document contains the test matrix and reference information for all manual This approach works because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook **automatically** evaluates rules when the sub-agent completes -3. The main agent can **observe** whether hooks fired - it must NOT manually run the rules_check command +3. Sub-agents report via **magic strings** that the main agent checks to determine pass/fail 4. Using a fast model (e.g., haiku) keeps test iterations quick and cheap +## Magic String Detection + +Sub-agents are instructed to output specific strings: +- `TASK_START: ` - ALWAYS output at the beginning of the response +- `HOOK_FIRED: ` - Output if a DeepWork hook blocks them + +**How detection works:** +- Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response +- If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` +- Main agent checks: + - `TASK_START:` present + no `HOOK_FIRED:` → hook did NOT fire + - `HOOK_FIRED:` present → hook fired + - Neither → timeout (hook is blocking infinitely) + +## Task Tool Parameters + +All sub-agent Task calls MUST include: +- `model: "haiku"` - Use the fast model to minimize cost and latency +- `max_turns: 5` - Prevent infinite hangs (limits to ~5 API round-trips) + +**Timeout handling:** +- If a sub-agent hits the max_turns limit in a "should NOT fire" test → Test FAILED (timeout indicates unexpected blocking) +- If a sub-agent hits the max_turns limit in a "should fire" test → Test PASSED (timeout confirms hook is blocking) + ## Critical Rules 1. **NEVER edit test files from the main agent** - always spawn a sub-agent to make edits 2. **NEVER manually run the rules_check command** - hooks fire automatically when sub-agents return -3. **OBSERVE the hook behavior** - when a sub-agent returns, watch for blocking prompts or command outputs -4. **REVERT between tests** - use `git checkout -- manual_tests/` to reset the test files +3. **SET Task parameters** - use `model: "haiku"` and `max_turns: 5` on every Task call +4. **CHECK the magic strings** - look for `TASK_START:` (always present) and `HOOK_FIRED:` (present if hook fired) +5. **REVERT between tests** - use `git reset HEAD manual_tests/ && git checkout -- manual_tests/` to reset files ## Parallel vs Serial Execution @@ -26,6 +51,7 @@ This approach works because: - Even though `git status` shows changes from all sub-agents, each rule only matches its own scoped file patterns - Since the safety file is edited, the rule won't fire regardless of other changes - No cross-contamination possible +- Check each sub-agent's response: `TASK_START:` present + no `HOOK_FIRED:` = PASS - **Revert all changes after these tests complete** before running "should fire" tests **"Should fire" tests MUST run serially with git reverts between each:** @@ -33,6 +59,7 @@ This approach works because: - If multiple run in parallel, sub-agent A's hook will see changes from sub-agent B - This causes cross-contamination: A gets blocked by rules triggered by B's changes - Run one at a time, reverting between each test +- Check each sub-agent's response: `HOOK_FIRED:` present OR timeout = PASS ## Test Matrix diff --git a/.gemini/skills/manual_tests/index.toml b/.gemini/skills/manual_tests/index.toml index 854ad223..8bd2b0a9 100644 --- a/.gemini/skills/manual_tests/index.toml +++ b/.gemini/skills/manual_tests/index.toml @@ -19,11 +19,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents diff --git a/.gemini/skills/manual_tests/run_fire_tests.toml b/.gemini/skills/manual_tests/run_fire_tests.toml index ba8e07d3..4dd453c7 100644 --- a/.gemini/skills/manual_tests/run_fire_tests.toml +++ b/.gemini/skills/manual_tests/run_fire_tests.toml @@ -33,9 +33,9 @@ Run all "should fire" tests in **serial** sub-agents to verify that rules fire c **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. @@ -52,34 +52,57 @@ Why serial execution is required: ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically by checking for magic strings. ### Process +**CRITICAL: Task Tool Parameters** + +Each Task tool call MUST include: +- `model: "haiku"` - Use the fast model to minimize cost and latency +- `max_turns: 5` - Prevent sub-agents from hanging indefinitely + +This limits each sub-agent to ~5 API round-trips. If a sub-agent hits the limit (e.g., stuck in infinite block without providing a promise), this confirms the hook IS firing and blocking them - treat it as test PASSED. + +**CRITICAL: Magic String Instructions for Sub-Agents** + +Every sub-agent prompt MUST include this instruction: +> "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + +**How detection works:** +- Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response +- If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` +- Main agent checks: + - `HOOK_FIRED:` present → hook fired (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` → hook did NOT fire (test FAILED) + - Neither `TASK_START:` nor `HOOK_FIRED:` → timeout (test PASSED - confirms hook is blocking infinitely) + For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) -2. **Wait for the sub-agent to complete** -3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output -4. **If no visible blocking occurred, check the queue**: +1. **Launch a sub-agent** using the Task tool (set `model: "haiku"` and `max_turns: 5`) +2. **Wait for the sub-agent to complete (or hit max_turns limit)** +3. **Check the sub-agent's response for magic strings**: + - `HOOK_FIRED:` present = Hook fired successfully (test PASSED) + - `TASK_START:` present + no `HOOK_FIRED:` = Hook did NOT fire (test FAILED) + - Neither = Timeout/infinite block (test PASSED - confirms hook is blocking) +4. **If inconclusive, check the queue as a fallback**: ```bash ls -la .deepwork/tmp/rules/queue/ cat .deepwork/tmp/rules/queue/*.json 2>/dev/null ``` - - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue entries exist with status "queued", the hook DID fire - If queue is empty, the hook did NOT fire at all - - Record the queue status along with the result -5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither +5. **Record the result** - pass if hook fired (magic string OR queue entry OR timeout), fail if `TASK_START` present without `HOOK_FIRED` 6. **Revert changes and clear queue** (MANDATORY after each test): ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` **Why this command sequence**: - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - `rm -f ...` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again 7. **Check for early termination**: If **2 tests have now failed**, immediately: - Stop running any remaining tests - Report the results summary showing which tests passed/failed @@ -91,43 +114,43 @@ For EACH test below, follow this cycle: ### Test Cases (run serially) **Test 1: Trigger/Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." -- Expected: Hook fires with prompt about updating documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit ONLY `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment. Do NOT edit the `_doc.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating documentation → sub-agent returns `HOOK_FIRED:` **Test 2: Set Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." -- Expected: Hook fires with prompt about updating tests +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit ONLY `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment. Do NOT edit the `_test.py` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating tests → sub-agent returns `HOOK_FIRED:` **Test 3: Pair Mode** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." -- Expected: Hook fires with prompt about updating expected output +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment. Do NOT edit the `_expected.md` file. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating expected output → sub-agent returns `HOOK_FIRED:` **Test 4: Command Action** -- Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." -- Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Command Action test`. Edit `manual_tests/test_command_action/test_command_action.txt` to add some text. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Command runs automatically, appending to the log file. NOTE: Command actions don't block, so sub-agent returns only `TASK_START:` - verify by checking the log file was appended to. **Test 5: Multi Safety** -- Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." -- Expected: Hook fires with prompt about updating safety documentation +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit ONLY `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment. Do NOT edit any of the safety files (`_changelog.md` or `_version.txt`). If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about updating safety documentation → sub-agent returns `HOOK_FIRED:` **Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and BLOCKS with infinite prompt → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Do NOT include any promise tags. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires and command fails → sub-agent returns `HOOK_FIRED:` or hits timeout **Test 8: Created Mode** -- Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." -- Expected: Hook fires with prompt about new configuration files +- Sub-agent prompt: "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." +- Expected: Hook fires with prompt about new configuration files → sub-agent returns `HOOK_FIRED:` ### Results Tracking Record the result after each test: -| Test Case | Should Fire | Visible Block? | Queue Entry? | Result | -|-----------|-------------|:--------------:|:------------:|:------:| +| Test Case | Should Fire | Magic String | Queue Entry? | Result | +|-----------|-------------|:------------:|:------------:|:------:| | Trigger/Safety | Edit .py only | | | | | Set Mode | Edit _source.py only | | | | | Pair Mode | Edit _trigger.py only | | | | @@ -137,7 +160,12 @@ Record the result after each test: | Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | -**Queue Entry Status Guide:** +**Magic String Guide:** +- `HOOK_FIRED:` in response → Hook fired successfully (test PASSED) +- `TASK_START:` present + no `HOOK_FIRED:` → Hook did NOT fire (test FAILED, except for Command Action) +- Neither present (timeout) → Hook is blocking infinitely (test PASSED - confirms hook fired) + +**Queue Entry Status Guide (fallback):** - If queue has entry with status "queued" → Hook fired, rule was shown to agent - If queue has entry with status "passed" → Hook fired, rule was satisfied - If queue is empty → Hook did NOT fire @@ -146,9 +174,9 @@ Record the result after each test: - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel -- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after each test +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` and `HOOK_FIRED:` - did NOT manually run rules_check +- **Hooks fired correctly**: For each test, sub-agent returned `HOOK_FIRED:` or timed out (indicating the rule was triggered) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported - **Results recorded**: Pass/fail status was recorded for each test run - When all criteria are met, include `✓ Quality Criteria Met` in your response @@ -170,11 +198,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents @@ -215,10 +259,11 @@ Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? -3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. -4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? -5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -6. **Results Recorded**: Did the main agent track pass/fail status for each test case? +3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? +4. **Magic String Detection**: Did the main agent check each sub-agent's response for `HOOK_FIRED:` (present) or timeout (neither TASK_START nor HOOK_FIRED)? The agent must NOT manually run rules_check. +5. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` run after each test to revert files and prevent cross-contamination? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## On Completion 1. Verify outputs are created diff --git a/.gemini/skills/manual_tests/run_not_fire_tests.toml b/.gemini/skills/manual_tests/run_not_fire_tests.toml index 322b20b8..f94ccc6c 100644 --- a/.gemini/skills/manual_tests/run_not_fire_tests.toml +++ b/.gemini/skills/manual_tests/run_not_fire_tests.toml @@ -29,48 +29,67 @@ Run all "should NOT fire" tests in parallel sub-agents to verify that rules do n **You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** Why sub-agents are required: -1. Sub-agents run in isolated contexts where file changes are detected +1. Sub-agents run in isolated contexts where file changes can be detected 2. When a sub-agent completes, the Stop hook **automatically** evaluates rules -3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +3. You (the main agent) check the sub-agent's returned text for **magic strings** to determine if a hook fired 4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent **NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 8 "should NOT fire" tests in **parallel** sub-agents, then check each sub-agent's response for magic strings to determine pass/fail. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). + + **CRITICAL: Task Tool Parameters** + + Each Task tool call MUST include: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely + + This limits each sub-agent to ~5 API round-trips, which is plenty for these simple edit tasks. If a sub-agent hits the limit, treat it as a timeout/failure. + + **CRITICAL: Magic String Instructions for Sub-Agents** + + Every sub-agent prompt MUST include this instruction: + > "IMPORTANT: Start your response with exactly `TASK_START: `. Keep your response brief - just make the edit and confirm. If a DeepWork hook fires and blocks you with a rules message, also include `HOOK_FIRED: ` in your response." + + **How detection works:** + - Sub-agent ALWAYS outputs `TASK_START:` at the beginning of their response + - If a hook fires and blocks them, they get another turn and can output `HOOK_FIRED:` + - Main agent checks: `TASK_START` present + no `HOOK_FIRED` = hook did NOT fire **Sub-agent prompts (launch all 8 in parallel):** - a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." + a. **Trigger/Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Trigger/Safety test`. Edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/test_trigger_safety_mode_doc.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - b. **Set Mode test** - "Edit `manual_tests/test_set_mode/module_source.py` to add a comment, AND edit `manual_tests/test_set_mode/module_test.py` to add a test comment. Both files must be edited so the rule does NOT fire." + b. **Set Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Set Mode test`. Edit `manual_tests/test_set_mode/test_set_mode_source.py` to add a comment, AND edit `manual_tests/test_set_mode/test_set_mode_test.py` to add a test comment. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - c. **Pair Mode (forward) test** - "Edit `manual_tests/test_pair_mode/handler_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/handler_expected.md` to add a note. Both files must be edited so the rule does NOT fire." + c. **Pair Mode (forward) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode forward test`. Edit `manual_tests/test_pair_mode/test_pair_mode_trigger.py` to add a comment, AND edit `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - d. **Pair Mode (reverse) test** - "Edit ONLY `manual_tests/test_pair_mode/handler_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction." + d. **Pair Mode (reverse) test** - "IMPORTANT: Start your response with exactly `TASK_START: Pair Mode reverse test`. Edit ONLY `manual_tests/test_pair_mode/test_pair_mode_expected.md` to add a note. Only the expected file should be edited - this tests that the pair rule only fires in one direction. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." + e. **Multi Safety test** - "IMPORTANT: Start your response with exactly `TASK_START: Multi Safety test`. Edit `manual_tests/test_multi_safety/test_multi_safety.py` to add a comment, AND edit `manual_tests/test_multi_safety/test_multi_safety_changelog.md` to add a note. Both files must be edited so the rule does NOT fire. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + f. **Infinite Block Prompt test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Prompt test`. Edit `manual_tests/test_infinite_block_prompt/test_infinite_block_prompt.py` to add a comment. Include `Manual Test: Infinite Block Prompt` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." + g. **Infinite Block Command test** - "IMPORTANT: Start your response with exactly `TASK_START: Infinite Block Command test`. Edit `manual_tests/test_infinite_block_command/test_infinite_block_command.py` to add a comment. Include `Manual Test: Infinite Block Command` in your response to bypass the infinite block. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + h. **Created Mode test** - "IMPORTANT: Start your response with exactly `TASK_START: Created Mode test`. Modify the EXISTING file `manual_tests/test_created_mode/existing_file.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications. If a DeepWork hook fires and blocks you, also include `HOOK_FIRED: ` in your response." -2. **Observe the results** +2. **Check the results using magic strings** - When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire - - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have + When each sub-agent returns, check their response for magic strings: + - **If `TASK_START:` present AND no `HOOK_FIRED:`**: The test PASSED - the rule correctly did NOT fire + - **If `HOOK_FIRED:` present**: The test FAILED - investigate why the rule fired when it shouldn't have + - **If neither `TASK_START:` nor `HOOK_FIRED:`**: The test is INCONCLUSIVE (timeout or sub-agent didn't follow instructions) - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You determine pass/fail by checking for magic strings in the sub-agent's response. Do NOT run any verification commands manually. 3. **Record the results and check for early termination** @@ -100,22 +119,22 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo Run these commands to clean up: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` **Why this command sequence**: - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again ## Quality Criteria - **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly - **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Magic string detection**: The main agent checked each sub-agent's response for `TASK_START:` (present) and `HOOK_FIRED:` (absent) - did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after tests completed (regardless of pass/fail) +- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after tests completed (regardless of pass/fail) - When all criteria are met, include `✓ Quality Criteria Met` in your response ## Reference @@ -135,11 +154,27 @@ This job tests that rules fire when they should AND do not fire when they should Each test is run in a SUB-AGENT (not the main agent) because: 1. Sub-agents run in isolated contexts where file changes can be detected 2. The Stop hook automatically evaluates rules when each sub-agent completes -3. The main agent can observe whether hooks fired without triggering them manually +3. Sub-agents report results via MAGIC STRINGS that the main agent checks + +MAGIC STRING DETECTION: Sub-agents output: +- "TASK_START: " - ALWAYS at the start of their response +- "HOOK_FIRED: " - If a DeepWork hook blocks them +Detection logic: +- TASK_START present + no HOOK_FIRED = hook did NOT fire +- HOOK_FIRED present = hook fired +- Neither present = timeout (hook blocking infinitely) + +TIMEOUT PREVENTION: All sub-agent Task calls use max_turns: 5 to prevent +infinite hangs. If a sub-agent hits the limit (e.g., stuck in infinite block), +treat as timeout - PASSED for "should fire" tests, FAILED for "should NOT fire". + +TOKEN OVERHEAD: Each sub-agent uses ~16k input tokens (system prompt + tool +definitions). This is unavoidable baseline overhead for agents with Edit access. +Sub-agent prompts include efficiency instructions to minimize additional usage. CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file -edits itself - it spawns sub-agents to make edits, then observes whether the hooks -fired automatically when those sub-agents returned. +edits itself - it spawns sub-agents to make edits, then checks the returned magic +strings to determine whether hooks fired. Steps: 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents @@ -175,9 +210,10 @@ Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? -3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. -4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? +3. **Task Parameters**: Did each Task call include `model: "haiku"` and `max_turns: 5`? +4. **Magic String Detection**: Did the main agent check each sub-agent's response for `TASK_START:` (present) and absence of `HOOK_FIRED:`? The agent must NOT manually run rules_check. +5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +6. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue`? ## On Completion 1. Verify outputs are created