diff --git a/.claude/settings.json b/.claude/settings.json index 33ef2b48..ac649fc3 100644 --- a/.claude/settings.json +++ b/.claude/settings.json @@ -120,7 +120,10 @@ "Skill(manual_tests.run_fire_tests)", "Skill(deepwork_rules)", "Skill(deepwork_rules.define)", - "Bash(deepwork rules clear_queue)" + "Bash(deepwork rules clear_queue)", + "Bash(rm -rf .deepwork/tmp/rules/queue/*.json)", + "Skill(manual_tests.reset)", + "Skill(manual_tests.infinite_block_tests)" ] }, "hooks": { diff --git a/.claude/skills/manual_tests.infinite_block_tests/SKILL.md b/.claude/skills/manual_tests.infinite_block_tests/SKILL.md new file mode 100644 index 00000000..a3e5f926 --- /dev/null +++ b/.claude/skills/manual_tests.infinite_block_tests/SKILL.md @@ -0,0 +1,292 @@ +--- +name: manual_tests.infinite_block_tests +description: "Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios." +user-invocable: false +hooks: + Stop: + - hooks: + - type: prompt + prompt: | + You must evaluate whether Claude has met all the below quality criteria for the request. + + ## Quality Criteria + + 1. **Sub-Agents Used**: Each test run via Task tool with `model: "haiku"` and `max_turns: 5` + 2. **Serial Execution**: Sub-agents launched ONE AT A TIME with reset between each + 3. **Promise Tests**: Completed WITHOUT blocking (promise bypassed the rule) + 4. **No-Promise Tests**: Hook fired AND sub-agent returned in reasonable time (not hung) + + ## Instructions + + Review the conversation and determine if ALL quality criteria above have been satisfied. + Look for evidence that each criterion has been addressed. + + If the agent has included `✓ Quality Criteria Met` in their response OR + all criteria appear to be met, respond with: {"ok": true} + + If criteria are NOT met AND the promise tag is missing, respond with: + {"ok": false, "reason": "**AGENT: TAKE ACTION** - [which criteria failed and why]"} + SubagentStop: + - hooks: + - type: prompt + prompt: | + You must evaluate whether Claude has met all the below quality criteria for the request. + + ## Quality Criteria + + 1. **Sub-Agents Used**: Each test run via Task tool with `model: "haiku"` and `max_turns: 5` + 2. **Serial Execution**: Sub-agents launched ONE AT A TIME with reset between each + 3. **Promise Tests**: Completed WITHOUT blocking (promise bypassed the rule) + 4. **No-Promise Tests**: Hook fired AND sub-agent returned in reasonable time (not hung) + + ## Instructions + + Review the conversation and determine if ALL quality criteria above have been satisfied. + Look for evidence that each criterion has been addressed. + + If the agent has included `✓ Quality Criteria Met` in their response OR + all criteria appear to be met, respond with: {"ok": true} + + If criteria are NOT met AND the promise tag is missing, respond with: + {"ok": false, "reason": "**AGENT: TAKE ACTION** - [which criteria failed and why]"} +--- + +# manual_tests.infinite_block_tests + +**Step 4/4** in **manual_tests** workflow + +> Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. + +## Prerequisites (Verify First) + +Before proceeding, confirm these steps are complete: +- `/manual_tests.run_fire_tests` + +## Instructions + +**Goal**: Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios. + +# Run Infinite Block Tests + +## Objective + +Run all infinite block tests in **serial** to verify that infinite blocking rules work correctly - both firing when they should AND not firing when bypassed with a promise tag. + +## CRITICAL: Sub-Agent Requirement + +**You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** + +Why sub-agents are required: +1. Sub-agents run in isolated contexts where file changes are detected +2. When a sub-agent completes, the Stop hook **automatically** evaluates rules +3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent + +**NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. + +## CRITICAL: Serial Execution + +**These tests MUST run ONE AT A TIME, with resets between each.** + +Why serial execution is required for infinite block tests: +- Infinite block tests can block indefinitely without a promise tag +- Running them in parallel would cause unpredictable blocking behavior +- Serial execution allows controlled observation of each test + +## Task + +Run all 4 infinite block tests in **serial**, resetting between each, and verify correct blocking behavior. + +### Process + +For EACH test below, follow this cycle: + +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - **Critical safeguard**: Limits API round-trips to prevent infinite hanging. The Task tool does not support a direct timeout, so max_turns is our only protection against runaway sub-agents. +2. **Wait for the sub-agent to complete** +3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output +4. **If no visible blocking occurred, check the queue**: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` + - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue is empty, the hook did NOT fire at all + - Record the queue status along with the result +5. **Record the result** - see expected outcomes for each test +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: + ```bash + git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml + deepwork rules clear_queue + ``` +7. **Check for early termination**: If **2 tests have now failed**, immediately: + - Stop running any remaining tests + - Report the results summary showing which tests passed/failed + - The job halts here - do NOT proceed with remaining tests +8. **Proceed to the next test** (only if fewer than 2 failures) + +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. + +### Test Cases (run serially) + +**Test 1: Infinite Block Prompt - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 2: Infinite Block Command - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 3: Infinite Block Prompt - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and shows blocking prompt + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +**Test 4: Infinite Block Command - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and command fails (exit code 1) + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +### Results Tracking + +Record the result after each test: + +| Test Case | Scenario | Should Fire? | Returned in Time? | Visible Block? | Queue Entry? | Result | +|-----------|----------|:------------:|:-----------------:|:--------------:|:------------:|:------:| +| Infinite Block Prompt | With promise | No | Yes | | | | +| Infinite Block Command | With promise | No | Yes | | | | +| Infinite Block Prompt | No promise | Yes | Yes | | | | +| Infinite Block Command | No promise | Yes | Yes | | | | + +**Result criteria:** +- **"Should NOT fire" tests (with promise)**: PASS if no blocking AND no queue entry AND returned quickly +- **"Should fire" tests (no promise)**: PASS if hook fired (visible block OR queue entry) AND returned in reasonable time (max_turns limit) + +**Queue Entry Status Guide:** +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire + +## Quality Criteria + +- **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel +- **Reset between tests**: Reset step was followed after each test +- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY +- **"Should NOT fire" tests verified**: Promise tests completed without blocking and no queue entries +- **"Should fire" tests verified**: Non-promise tests fired (visible block OR queue entry) AND returned in reasonable time (not hung indefinitely) +- **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported +- **Results recorded**: Pass/fail status was recorded for each test run +- When all criteria are met, include `Quality Criteria Met` in your response + +## Reference + +See [test_reference.md](test_reference.md) for the complete test matrix and rule descriptions. + +## Context + +This step runs after both the "should NOT fire" and "should fire" test steps. It specifically tests infinite blocking behavior which requires serial execution due to the blocking nature of these rules. + + +### Job Context + +A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. + +This job tests that rules fire when they should AND do not fire when they shouldn't. +Each test is run in a SUB-AGENT (not the main agent) because: +1. Sub-agents run in isolated contexts where file changes can be detected +2. The Stop hook automatically evaluates rules when each sub-agent completes +3. The main agent can observe whether hooks fired without triggering them manually + +CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file +edits itself - it spawns sub-agents to make edits, then observes whether the hooks +fired automatically when those sub-agents returned. + +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + +Steps: +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue + +Test types covered: +- Trigger/Safety mode +- Set mode (bidirectional) +- Pair mode (directional) +- Command action +- Multi safety +- Infinite block (prompt and command) - in dedicated step +- Created mode (new files only) + + +## Required Inputs + + +**Files from Previous Steps** - Read these first: +- `fire_results` (from `run_fire_tests`) + +## Work Branch + +Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` + +- If on a matching work branch: continue using it +- If on main/master: create new branch with `git checkout -b deepwork/manual_tests-[instance]-$(date +%Y%m%d)` + +## Outputs + +**Required outputs**: +- `infinite_block_results` + +## Guardrails + +- Do NOT skip prerequisite verification if this step has dependencies +- Do NOT produce partial outputs; complete all required outputs before finishing +- Do NOT proceed without required inputs; ask the user if any are missing +- Do NOT modify files outside the scope of this step's defined outputs + +## Quality Validation + +Stop hooks will automatically validate your work. The loop continues until all criteria pass. + +**Criteria (all must be satisfied)**: +1. **Sub-Agents Used**: Each test run via Task tool with `model: "haiku"` and `max_turns: 5` +2. **Serial Execution**: Sub-agents launched ONE AT A TIME with reset between each +3. **Promise Tests**: Completed WITHOUT blocking (promise bypassed the rule) +4. **No-Promise Tests**: Hook fired AND sub-agent returned in reasonable time (not hung) + + +**To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. + +## On Completion + +1. Verify outputs are created +2. Inform user: "Step 4/4 complete, outputs: infinite_block_results" +3. **Workflow complete**: All steps finished. Consider creating a PR to merge the work branch. + +--- + +**Reference files**: `.deepwork/jobs/manual_tests/job.yml`, `.deepwork/jobs/manual_tests/steps/infinite_block_tests.md` \ No newline at end of file diff --git a/.claude/skills/manual_tests.reset/SKILL.md b/.claude/skills/manual_tests.reset/SKILL.md new file mode 100644 index 00000000..b7b30058 --- /dev/null +++ b/.claude/skills/manual_tests.reset/SKILL.md @@ -0,0 +1,176 @@ +--- +name: manual_tests.reset +description: "Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue." +user-invocable: false +hooks: + Stop: + - hooks: + - type: prompt + prompt: | + You must evaluate whether Claude has met all the below quality criteria for the request. + + ## Quality Criteria + + 1. **Environment Clean**: Git changes reverted, created files removed, and rules queue cleared + + ## Instructions + + Review the conversation and determine if ALL quality criteria above have been satisfied. + Look for evidence that each criterion has been addressed. + + If the agent has included `✓ Quality Criteria Met` in their response OR + all criteria appear to be met, respond with: {"ok": true} + + If criteria are NOT met AND the promise tag is missing, respond with: + {"ok": false, "reason": "**AGENT: TAKE ACTION** - [which criteria failed and why]"} + SubagentStop: + - hooks: + - type: prompt + prompt: | + You must evaluate whether Claude has met all the below quality criteria for the request. + + ## Quality Criteria + + 1. **Environment Clean**: Git changes reverted, created files removed, and rules queue cleared + + ## Instructions + + Review the conversation and determine if ALL quality criteria above have been satisfied. + Look for evidence that each criterion has been addressed. + + If the agent has included `✓ Quality Criteria Met` in their response OR + all criteria appear to be met, respond with: {"ok": true} + + If criteria are NOT met AND the promise tag is missing, respond with: + {"ok": false, "reason": "**AGENT: TAKE ACTION** - [which criteria failed and why]"} +--- + +# manual_tests.reset + +**Step 1/4** in **manual_tests** workflow + +> Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. + + +## Instructions + +**Goal**: Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue. + +# Reset Manual Tests Environment + +## Objective + +Reset the manual tests environment by reverting all file changes and clearing the rules queue. + +## Purpose + +This step contains all the reset logic that other steps can call when they need to clean up between or after tests. It ensures consistent cleanup across all test steps. + +## Reset Commands + +Run these commands to reset the environment: + +```bash +git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml +deepwork rules clear_queue +``` + +## Command Explanation + +- `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) +- `git checkout -- manual_tests/` - Reverts working tree to match HEAD +- `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests (the created mode test creates this file) +- `deepwork rules clear_queue` - Clears the rules queue so rules can fire again (prevents anti-infinite-loop mechanism from blocking subsequent tests) + +## When to Reset + +- **After each serial test**: Reset immediately after observing the result to prevent cross-contamination +- **After parallel tests complete**: Reset once all parallel sub-agents have returned +- **On early termination**: Reset before reporting failure results +- **Before starting a new test step**: Ensure clean state + +## Quality Criteria + +- **All changes reverted**: `git status` shows no changes in `manual_tests/` +- **Queue cleared**: `.deepwork/tmp/rules/queue/` is empty +- **New files removed**: `manual_tests/test_created_mode/new_config.yml` does not exist + + +### Job Context + +A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. + +This job tests that rules fire when they should AND do not fire when they shouldn't. +Each test is run in a SUB-AGENT (not the main agent) because: +1. Sub-agents run in isolated contexts where file changes can be detected +2. The Stop hook automatically evaluates rules when each sub-agent completes +3. The main agent can observe whether hooks fired without triggering them manually + +CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file +edits itself - it spawns sub-agents to make edits, then observes whether the hooks +fired automatically when those sub-agents returned. + +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + +Steps: +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue + +Test types covered: +- Trigger/Safety mode +- Set mode (bidirectional) +- Pair mode (directional) +- Command action +- Multi safety +- Infinite block (prompt and command) - in dedicated step +- Created mode (new files only) + + + +## Work Branch + +Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` + +- If on a matching work branch: continue using it +- If on main/master: create new branch with `git checkout -b deepwork/manual_tests-[instance]-$(date +%Y%m%d)` + +## Outputs + +**Required outputs**: +- `clean_environment` + +## Guardrails + +- Do NOT skip prerequisite verification if this step has dependencies +- Do NOT produce partial outputs; complete all required outputs before finishing +- Do NOT proceed without required inputs; ask the user if any are missing +- Do NOT modify files outside the scope of this step's defined outputs + +## Quality Validation + +Stop hooks will automatically validate your work. The loop continues until all criteria pass. + +**Criteria (all must be satisfied)**: +1. **Environment Clean**: Git changes reverted, created files removed, and rules queue cleared + + +**To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. + +## On Completion + +1. Verify outputs are created +2. Inform user: "Step 1/4 complete, outputs: clean_environment" +3. **Continue workflow**: Use Skill tool to invoke `/manual_tests.run_not_fire_tests` + +--- + +**Reference files**: `.deepwork/jobs/manual_tests/job.yml`, `.deepwork/jobs/manual_tests/steps/reset.md` \ No newline at end of file diff --git a/.claude/skills/manual_tests.run_fire_tests/SKILL.md b/.claude/skills/manual_tests.run_fire_tests/SKILL.md index 86edc039..9139f46b 100644 --- a/.claude/skills/manual_tests.run_fire_tests/SKILL.md +++ b/.claude/skills/manual_tests.run_fire_tests/SKILL.md @@ -1,6 +1,6 @@ --- name: manual_tests.run_fire_tests -description: "Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly." +description: "Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly." user-invocable: false hooks: Stop: @@ -12,11 +12,12 @@ hooks: ## Quality Criteria 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. - 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? - 3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. - 4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? - 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 6. **Results Recorded**: Did the main agent track pass/fail status for each test case? + 2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? + 3. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? + 4. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. + 5. **Reset Between Tests**: Was the reset step called internally after each test to revert files and prevent cross-contamination? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## Instructions @@ -37,11 +38,12 @@ hooks: ## Quality Criteria 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. - 2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? - 3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. - 4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? - 5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 6. **Results Recorded**: Did the main agent track pass/fail status for each test case? + 2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? + 3. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? + 4. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. + 5. **Reset Between Tests**: Was the reset step called internally after each test to revert files and prevent cross-contamination? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## Instructions @@ -57,7 +59,7 @@ hooks: # manual_tests.run_fire_tests -**Step 2/2** in **manual_tests** workflow +**Step 3/4** in **manual_tests** workflow > Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. @@ -68,7 +70,7 @@ Before proceeding, confirm these steps are complete: ## Instructions -**Goal**: Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly. +**Goal**: Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly. # Run Should-Fire Tests @@ -90,23 +92,25 @@ Why sub-agents are required: ## CRITICAL: Serial Execution -**These tests MUST run ONE AT A TIME, with git reverts between each.** +**These tests MUST run ONE AT A TIME, with resets between each.** Why serial execution is required: - These tests edit ONLY the trigger file (not the safety) - If multiple sub-agents run in parallel, sub-agent A's hook will see changes from sub-agent B - This causes cross-contamination: A gets blocked by rules triggered by B's changes -- Run one test, observe the hook, revert, then run the next +- Run one test, observe the hook, reset, then run the next ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 6 "should fire" tests in **serial** sub-agents, resetting between each, and verify that blocking hooks fire automatically. ### Process For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely 2. **Wait for the sub-agent to complete** 3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output 4. **If no visible blocking occurred, check the queue**: @@ -118,56 +122,50 @@ For EACH test below, follow this cycle: - If queue is empty, the hook did NOT fire at all - Record the queue status along with the result 5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither -6. **Revert changes and clear queue** (MANDATORY after each test): +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f ...` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + See [reset.md](reset.md) for detailed explanation of these commands. 7. **Check for early termination**: If **2 tests have now failed**, immediately: - Stop running any remaining tests - Report the results summary showing which tests passed/failed - The job halts here - do NOT proceed with remaining tests 8. **Proceed to the next test** (only if fewer than 2 failures) -**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and revert before launching the next. +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. ### Test Cases (run serially) **Test 1: Trigger/Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating documentation **Test 2: Set Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating tests **Test 3: Pair Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating expected output **Test 4: Command Action** - Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) **Test 5: Multi Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating safety documentation -**Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided - -**Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided - -**Test 8: Created Mode** +**Test 6: Created Mode** - Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about new configuration files ### Results Tracking @@ -181,25 +179,23 @@ Record the result after each test: | Pair Mode | Edit _trigger.py only | | | | | Command Action | Edit .txt | | | | | Multi Safety | Edit .py only | | | | -| Infinite Block Prompt | Edit .py (no promise) | | | | -| Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | **Queue Entry Status Guide:** -- If queue has entry with status "queued" → Hook fired, rule was shown to agent -- If queue has entry with status "passed" → Hook fired, rule was satisfied -- If queue is empty → Hook did NOT fire +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire ## Quality Criteria - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel -- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Reset between tests**: Reset step was followed after each test +- **Hooks fired automatically**: The main agent observed the blocking hooks firing automatically when each sub-agent returned - the agent did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Results recorded**: Pass/fail status was recorded for each test run -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Results recorded**: Pass/fail status was recorded for each test case +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -207,7 +203,7 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with reverts is essential to prevent cross-contamination between tests. +This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with resets is essential to prevent cross-contamination between tests. Infinite block tests are handled in a separate step. ### Job Context @@ -224,9 +220,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -234,7 +241,7 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) @@ -269,11 +276,12 @@ Stop hooks will automatically validate your work. The loop continues until all c **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. -2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? -3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. -4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? -5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -6. **Results Recorded**: Did the main agent track pass/fail status for each test case? +2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? +3. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? +4. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. +5. **Reset Between Tests**: Was the reset step called internally after each test to revert files and prevent cross-contamination? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Results Recorded**: Did the main agent track pass/fail status for each test case? **To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. @@ -281,8 +289,8 @@ Stop hooks will automatically validate your work. The loop continues until all c ## On Completion 1. Verify outputs are created -2. Inform user: "Step 2/2 complete, outputs: fire_results" -3. **Workflow complete**: All steps finished. Consider creating a PR to merge the work branch. +2. Inform user: "Step 3/4 complete, outputs: fire_results" +3. **Continue workflow**: Use Skill tool to invoke `/manual_tests.infinite_block_tests` --- diff --git a/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md b/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md index 2597c0f3..4ea21e41 100644 --- a/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md +++ b/.claude/skills/manual_tests.run_not_fire_tests/SKILL.md @@ -1,6 +1,6 @@ --- name: manual_tests.run_not_fire_tests -description: "Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." +description: "Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." user-invocable: false hooks: Stop: @@ -12,10 +12,12 @@ hooks: ## Quality Criteria 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. - 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? - 3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. - 4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? + 2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? + 3. **Parallel Execution**: Were all 6 sub-agents launched in parallel (in a single message with multiple Task tool calls)? + 4. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. + 5. **Queue Verified Empty**: After all sub-agents completed, was the rules queue checked and confirmed empty (no entries = rules did not fire)? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Reset Performed**: Was the reset step called internally after tests completed (or after early termination)? ## Instructions @@ -36,10 +38,12 @@ hooks: ## Quality Criteria 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. - 2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? - 3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. - 4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? - 5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? + 2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? + 3. **Parallel Execution**: Were all 6 sub-agents launched in parallel (in a single message with multiple Task tool calls)? + 4. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. + 5. **Queue Verified Empty**: After all sub-agents completed, was the rules queue checked and confirmed empty (no entries = rules did not fire)? + 6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? + 7. **Reset Performed**: Was the reset step called internally after tests completed (or after early termination)? ## Instructions @@ -55,14 +59,18 @@ hooks: # manual_tests.run_not_fire_tests -**Step 1/2** in **manual_tests** workflow +**Step 2/4** in **manual_tests** workflow > Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. +## Prerequisites (Verify First) + +Before proceeding, confirm these steps are complete: +- `/manual_tests.reset` ## Instructions -**Goal**: Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. +**Goal**: Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. # Run Should-NOT-Fire Tests @@ -84,15 +92,19 @@ Why sub-agents are required: ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 6 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). + + **Sub-agent configuration for ALL sub-agents:** + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely - **Sub-agent prompts (launch all 8 in parallel):** + **Sub-agent prompts (launch all 6 in parallel):** a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." @@ -104,65 +116,72 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + f. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." 2. **Observe the results** When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire + - **If no blocking hook fired**: Preliminary pass - proceed to queue verification - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually during sub-agent execution. + +3. **Verify no queue entries** (CRITICAL for "should NOT fire" tests) + + After ALL sub-agents have completed, verify the rules queue is empty: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` -3. **Record the results and check for early termination** + - **If queue is empty**: All tests PASSED - rules correctly did not fire + - **If queue has entries**: Tests FAILED - rules fired when they shouldn't have. Check which rule fired and investigate. + + This verification is essential because some rules may fire without visible blocking but still create queue entries. + +4. **Record the results and check for early termination** Track which tests passed and which failed: - | Test Case | Should NOT Fire | Result | - |-----------|:---------------:|:------:| - | Trigger/Safety | Edit both files | | - | Set Mode | Edit both files | | - | Pair Mode (forward) | Edit both files | | - | Pair Mode (reverse) | Edit expected only | | - | Multi Safety | Edit both files | | - | Infinite Block Prompt | Promise tag | | - | Infinite Block Command | Promise tag | | - | Created Mode | Modify existing | | + | Test Case | Should NOT Fire | Visible Block? | Queue Entry? | Result | + |-----------|:---------------:|:--------------:|:------------:|:------:| + | Trigger/Safety | Edit both files | | | | + | Set Mode | Edit both files | | | | + | Pair Mode (forward) | Edit both files | | | | + | Pair Mode (reverse) | Edit expected only | | | | + | Multi Safety | Edit both files | | | | + | Created Mode | Modify existing | | | | + + **Result criteria**: PASS only if NO visible block AND NO queue entry. FAIL if either occurred. **EARLY TERMINATION**: If **2 tests have failed**, immediately: 1. Stop running any remaining tests - 2. Revert all changes and clear queue (see step 4) + 2. Reset (see step 5) 3. Report the results summary showing which tests passed/failed 4. Do NOT proceed to the next step - the job halts here -4. **Revert all changes and clear queue** +5. **Reset** (MANDATORY - call the reset step internally) **IMPORTANT**: This step is MANDATORY and must run regardless of whether tests passed or failed. - Run these commands to clean up: + Follow the reset step instructions. Run these commands to clean up: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + See [reset.md](reset.md) for detailed explanation of these commands. ## Quality Criteria -- **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly -- **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) +- **Sub-agents spawned**: All 6 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Parallel execution**: All 6 sub-agents were launched in a single message (parallel) - **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Queue verified empty**: After all sub-agents completed, the rules queue was checked and confirmed empty (no queue entries = rules did not fire) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after tests completed (regardless of pass/fail) -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Reset performed**: Reset step was followed after tests completed (regardless of pass/fail) +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -170,7 +189,7 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs first and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete and the working directory is reverted. +This step runs after the reset step (which ensures a clean environment) and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete. Infinite block tests are handled in a separate step. ### Job Context @@ -187,9 +206,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -197,10 +227,15 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) +## Required Inputs + + +**Files from Previous Steps** - Read these first: +- `clean_environment` (from `reset`) ## Work Branch @@ -227,10 +262,12 @@ Stop hooks will automatically validate your work. The loop continues until all c **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. -2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? -3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. -4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? +2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? +3. **Parallel Execution**: Were all 6 sub-agents launched in parallel (in a single message with multiple Task tool calls)? +4. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. +5. **Queue Verified Empty**: After all sub-agents completed, was the rules queue checked and confirmed empty (no entries = rules did not fire)? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Reset Performed**: Was the reset step called internally after tests completed (or after early termination)? **To complete**: Include `✓ Quality Criteria Met` in your final response only after verifying ALL criteria are satisfied. @@ -238,7 +275,7 @@ Stop hooks will automatically validate your work. The loop continues until all c ## On Completion 1. Verify outputs are created -2. Inform user: "Step 1/2 complete, outputs: not_fire_results" +2. Inform user: "Step 2/4 complete, outputs: not_fire_results" 3. **Continue workflow**: Use Skill tool to invoke `/manual_tests.run_fire_tests` --- diff --git a/.claude/skills/manual_tests/SKILL.md b/.claude/skills/manual_tests/SKILL.md index bf97b88a..94e3b0ad 100644 --- a/.claude/skills/manual_tests/SKILL.md +++ b/.claude/skills/manual_tests/SKILL.md @@ -21,9 +21,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -31,28 +42,32 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) ## Available Steps -1. **run_not_fire_tests** - Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. -2. **run_fire_tests** - Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly. (requires: run_not_fire_tests) +1. **reset** - Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue. +2. **run_not_fire_tests** - Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. (requires: reset) +3. **run_fire_tests** - Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly. (requires: run_not_fire_tests) +4. **infinite_block_tests** - Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios. (requires: run_fire_tests) ## Execution Instructions ### Step 1: Analyze Intent Parse any text following `/manual_tests` to determine user intent: +- "reset" or related terms → start at `manual_tests.reset` - "run_not_fire_tests" or related terms → start at `manual_tests.run_not_fire_tests` - "run_fire_tests" or related terms → start at `manual_tests.run_fire_tests` +- "infinite_block_tests" or related terms → start at `manual_tests.infinite_block_tests` ### Step 2: Invoke Starting Step Use the Skill tool to invoke the identified starting step: ``` -Skill tool: manual_tests.run_not_fire_tests +Skill tool: manual_tests.reset ``` ### Step 3: Continue Workflow Automatically diff --git a/.deepwork/jobs/manual_tests/job.yml b/.deepwork/jobs/manual_tests/job.yml index b2662c2c..1751a96d 100644 --- a/.deepwork/jobs/manual_tests/job.yml +++ b/.deepwork/jobs/manual_tests/job.yml @@ -1,5 +1,5 @@ name: manual_tests -version: "1.2.1" +version: "1.3.0" summary: "Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly." description: | A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. @@ -14,9 +14,20 @@ description: | edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. + Sub-agent configuration: + - All sub-agents should use `model: "haiku"` to minimize cost and latency + - All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: - 1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents - 2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between + 1. reset - Ensure clean environment before testing (clears queue, reverts files) + 2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) + 3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) + 4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + + Reset procedure (see steps/reset.md): + - Reset runs FIRST to ensure a clean environment before any tests + - Each step also calls reset internally when needed (between tests, after completion) + - Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -24,10 +35,12 @@ description: | - Pair mode (directional) - Command action - Multi safety - - Infinite block (prompt and command) + - Infinite block (prompt and command) - in dedicated step - Created mode (new files only) changelog: + - version: "1.3.0" + changes: "Add model/max_turns config for sub-agents; move infinite block tests to dedicated serial step; add reset step that runs first; verify queue empty for 'should NOT fire' tests" - version: "1.2.1" changes: "Fixed incomplete revert - now uses git reset HEAD to unstage files (rules_check stages with git add -A)" - version: "1.2.0" @@ -38,36 +51,70 @@ changelog: changes: "Initial job creation - tests run in sub-agents to observe automatic hook firing" steps: + - id: reset + name: "Reset Manual Tests Environment" + description: "Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue." + instructions_file: steps/reset.md + inputs: [] + outputs: + - clean_environment + dependencies: [] + quality_criteria: + - "**Environment Clean**: Git changes reverted, created files removed, and rules queue cleared" + - id: run_not_fire_tests name: "Run Should-NOT-Fire Tests" - description: "Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." + description: "Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." instructions_file: steps/run_not_fire_tests.md - inputs: [] + inputs: + - file: clean_environment + from_step: reset outputs: - - not_fire_results # implicit state: all "should NOT fire" tests passed - dependencies: [] + - not_fire_results + dependencies: + - reset quality_criteria: - "**Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly." - - "**Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)?" + - "**Sub-Agent Config**: Did all sub-agents use `model: \"haiku\"` and `max_turns: 5`?" + - "**Parallel Execution**: Were all 6 sub-agents launched in parallel (in a single message with multiple Task tool calls)?" - "**Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command." + - "**Queue Verified Empty**: After all sub-agents completed, was the rules queue checked and confirmed empty (no entries = rules did not fire)?" - "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?" - - "**Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`?" + - "**Reset Performed**: Was the reset step called internally after tests completed (or after early termination)?" - id: run_fire_tests name: "Run Should-Fire Tests" - description: "Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly." + description: "Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly." instructions_file: steps/run_fire_tests.md inputs: - file: not_fire_results from_step: run_not_fire_tests outputs: - - fire_results # implicit state: all "should fire" tests passed + - fire_results dependencies: - run_not_fire_tests quality_criteria: - "**Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly." + - "**Sub-Agent Config**: Did all sub-agents use `model: \"haiku\"` and `max_turns: 5`?" - "**Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination?" - "**Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command." - - "**Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination?" + - "**Reset Between Tests**: Was the reset step called internally after each test to revert files and prevent cross-contamination?" - "**Early Termination**: If 2 tests failed, did testing halt immediately with results reported?" - "**Results Recorded**: Did the main agent track pass/fail status for each test case?" + + - id: infinite_block_tests + name: "Run Infinite Block Tests" + description: "Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios." + instructions_file: steps/infinite_block_tests.md + inputs: + - file: fire_results + from_step: run_fire_tests + outputs: + - infinite_block_results + dependencies: + - run_fire_tests + quality_criteria: + - "**Sub-Agents Used**: Each test run via Task tool with `model: \"haiku\"` and `max_turns: 5`" + - "**Serial Execution**: Sub-agents launched ONE AT A TIME with reset between each" + - "**Promise Tests**: Completed WITHOUT blocking (promise bypassed the rule)" + - "**No-Promise Tests**: Hook fired AND sub-agent returned in reasonable time (not hung)" diff --git a/.deepwork/jobs/manual_tests/steps/infinite_block_tests.md b/.deepwork/jobs/manual_tests/steps/infinite_block_tests.md new file mode 100644 index 00000000..5932c9e2 --- /dev/null +++ b/.deepwork/jobs/manual_tests/steps/infinite_block_tests.md @@ -0,0 +1,136 @@ +# Run Infinite Block Tests + +## Objective + +Run all infinite block tests in **serial** to verify that infinite blocking rules work correctly - both firing when they should AND not firing when bypassed with a promise tag. + +## CRITICAL: Sub-Agent Requirement + +**You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** + +Why sub-agents are required: +1. Sub-agents run in isolated contexts where file changes are detected +2. When a sub-agent completes, the Stop hook **automatically** evaluates rules +3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent + +**NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. + +## CRITICAL: Serial Execution + +**These tests MUST run ONE AT A TIME, with resets between each.** + +Why serial execution is required for infinite block tests: +- Infinite block tests can block indefinitely without a promise tag +- Running them in parallel would cause unpredictable blocking behavior +- Serial execution allows controlled observation of each test + +## Task + +Run all 4 infinite block tests in **serial**, resetting between each, and verify correct blocking behavior. + +### Process + +For EACH test below, follow this cycle: + +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - **Critical safeguard**: Limits API round-trips to prevent infinite hanging. The Task tool does not support a direct timeout, so max_turns is our only protection against runaway sub-agents. +2. **Wait for the sub-agent to complete** +3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output +4. **If no visible blocking occurred, check the queue**: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` + - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue is empty, the hook did NOT fire at all + - Record the queue status along with the result +5. **Record the result** - see expected outcomes for each test +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: + ```bash + git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml + deepwork rules clear_queue + ``` +7. **Check for early termination**: If **2 tests have now failed**, immediately: + - Stop running any remaining tests + - Report the results summary showing which tests passed/failed + - The job halts here - do NOT proceed with remaining tests +8. **Proceed to the next test** (only if fewer than 2 failures) + +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. + +### Test Cases (run serially) + +**Test 1: Infinite Block Prompt - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 2: Infinite Block Command - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 3: Infinite Block Prompt - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and shows blocking prompt + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +**Test 4: Infinite Block Command - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and command fails (exit code 1) + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +### Results Tracking + +Record the result after each test: + +| Test Case | Scenario | Should Fire? | Returned in Time? | Visible Block? | Queue Entry? | Result | +|-----------|----------|:------------:|:-----------------:|:--------------:|:------------:|:------:| +| Infinite Block Prompt | With promise | No | Yes | | | | +| Infinite Block Command | With promise | No | Yes | | | | +| Infinite Block Prompt | No promise | Yes | Yes | | | | +| Infinite Block Command | No promise | Yes | Yes | | | | + +**Result criteria:** +- **"Should NOT fire" tests (with promise)**: PASS if no blocking AND no queue entry AND returned quickly +- **"Should fire" tests (no promise)**: PASS if hook fired (visible block OR queue entry) AND returned in reasonable time (max_turns limit) + +**Queue Entry Status Guide:** +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire + +## Quality Criteria + +- **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel +- **Reset between tests**: Reset step was followed after each test +- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY +- **"Should NOT fire" tests verified**: Promise tests completed without blocking and no queue entries +- **"Should fire" tests verified**: Non-promise tests fired (visible block OR queue entry) AND returned in reasonable time (not hung indefinitely) +- **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported +- **Results recorded**: Pass/fail status was recorded for each test run +- When all criteria are met, include `Quality Criteria Met` in your response + +## Reference + +See [test_reference.md](test_reference.md) for the complete test matrix and rule descriptions. + +## Context + +This step runs after both the "should NOT fire" and "should fire" test steps. It specifically tests infinite blocking behavior which requires serial execution due to the blocking nature of these rules. diff --git a/.deepwork/jobs/manual_tests/steps/reset.md b/.deepwork/jobs/manual_tests/steps/reset.md new file mode 100644 index 00000000..b6eb4fb7 --- /dev/null +++ b/.deepwork/jobs/manual_tests/steps/reset.md @@ -0,0 +1,38 @@ +# Reset Manual Tests Environment + +## Objective + +Reset the manual tests environment by reverting all file changes and clearing the rules queue. + +## Purpose + +This step contains all the reset logic that other steps can call when they need to clean up between or after tests. It ensures consistent cleanup across all test steps. + +## Reset Commands + +Run these commands to reset the environment: + +```bash +git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml +deepwork rules clear_queue +``` + +## Command Explanation + +- `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) +- `git checkout -- manual_tests/` - Reverts working tree to match HEAD +- `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests (the created mode test creates this file) +- `deepwork rules clear_queue` - Clears the rules queue so rules can fire again (prevents anti-infinite-loop mechanism from blocking subsequent tests) + +## When to Reset + +- **After each serial test**: Reset immediately after observing the result to prevent cross-contamination +- **After parallel tests complete**: Reset once all parallel sub-agents have returned +- **On early termination**: Reset before reporting failure results +- **Before starting a new test step**: Ensure clean state + +## Quality Criteria + +- **All changes reverted**: `git status` shows no changes in `manual_tests/` +- **Queue cleared**: `.deepwork/tmp/rules/queue/` is empty +- **New files removed**: `manual_tests/test_created_mode/new_config.yml` does not exist diff --git a/.deepwork/jobs/manual_tests/steps/run_fire_tests.md b/.deepwork/jobs/manual_tests/steps/run_fire_tests.md index 27f3cfc8..787dc3ef 100644 --- a/.deepwork/jobs/manual_tests/steps/run_fire_tests.md +++ b/.deepwork/jobs/manual_tests/steps/run_fire_tests.md @@ -18,23 +18,25 @@ Why sub-agents are required: ## CRITICAL: Serial Execution -**These tests MUST run ONE AT A TIME, with git reverts between each.** +**These tests MUST run ONE AT A TIME, with resets between each.** Why serial execution is required: - These tests edit ONLY the trigger file (not the safety) - If multiple sub-agents run in parallel, sub-agent A's hook will see changes from sub-agent B - This causes cross-contamination: A gets blocked by rules triggered by B's changes -- Run one test, observe the hook, revert, then run the next +- Run one test, observe the hook, reset, then run the next ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 6 "should fire" tests in **serial** sub-agents, resetting between each, and verify that blocking hooks fire automatically. ### Process For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely 2. **Wait for the sub-agent to complete** 3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output 4. **If no visible blocking occurred, check the queue**: @@ -46,56 +48,50 @@ For EACH test below, follow this cycle: - If queue is empty, the hook did NOT fire at all - Record the queue status along with the result 5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither -6. **Revert changes and clear queue** (MANDATORY after each test): +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f ...` - Removes any new files created during tests - - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again + See [reset.md](reset.md) for detailed explanation of these commands. 7. **Check for early termination**: If **2 tests have now failed**, immediately: - Stop running any remaining tests - Report the results summary showing which tests passed/failed - The job halts here - do NOT proceed with remaining tests 8. **Proceed to the next test** (only if fewer than 2 failures) -**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and revert before launching the next. +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. ### Test Cases (run serially) **Test 1: Trigger/Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating documentation **Test 2: Set Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating tests **Test 3: Pair Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating expected output **Test 4: Command Action** - Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) **Test 5: Multi Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating safety documentation -**Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided - -**Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided - -**Test 8: Created Mode** +**Test 6: Created Mode** - Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about new configuration files ### Results Tracking @@ -109,25 +105,23 @@ Record the result after each test: | Pair Mode | Edit _trigger.py only | | | | | Command Action | Edit .txt | | | | | Multi Safety | Edit .py only | | | | -| Infinite Block Prompt | Edit .py (no promise) | | | | -| Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | **Queue Entry Status Guide:** -- If queue has entry with status "queued" → Hook fired, rule was shown to agent -- If queue has entry with status "passed" → Hook fired, rule was satisfied -- If queue is empty → Hook did NOT fire +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire ## Quality Criteria - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel -- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Reset between tests**: Reset step was followed after each test +- **Hooks fired automatically**: The main agent observed the blocking hooks firing automatically when each sub-agent returned - the agent did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Results recorded**: Pass/fail status was recorded for each test run -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Results recorded**: Pass/fail status was recorded for each test case +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -135,4 +129,4 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with reverts is essential to prevent cross-contamination between tests. +This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with resets is essential to prevent cross-contamination between tests. Infinite block tests are handled in a separate step. diff --git a/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md b/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md index 2fb25975..2982c69b 100644 --- a/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md +++ b/.deepwork/jobs/manual_tests/steps/run_not_fire_tests.md @@ -18,15 +18,19 @@ Why sub-agents are required: ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 6 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). - **Sub-agent prompts (launch all 8 in parallel):** + **Sub-agent configuration for ALL sub-agents:** + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely + + **Sub-agent prompts (launch all 6 in parallel):** a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." @@ -38,65 +42,72 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + f. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." 2. **Observe the results** When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire + - **If no blocking hook fired**: Preliminary pass - proceed to queue verification - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually during sub-agent execution. + +3. **Verify no queue entries** (CRITICAL for "should NOT fire" tests) + + After ALL sub-agents have completed, verify the rules queue is empty: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` + + - **If queue is empty**: All tests PASSED - rules correctly did not fire + - **If queue has entries**: Tests FAILED - rules fired when they shouldn't have. Check which rule fired and investigate. -3. **Record the results and check for early termination** + This verification is essential because some rules may fire without visible blocking but still create queue entries. + +4. **Record the results and check for early termination** Track which tests passed and which failed: - | Test Case | Should NOT Fire | Result | - |-----------|:---------------:|:------:| - | Trigger/Safety | Edit both files | | - | Set Mode | Edit both files | | - | Pair Mode (forward) | Edit both files | | - | Pair Mode (reverse) | Edit expected only | | - | Multi Safety | Edit both files | | - | Infinite Block Prompt | Promise tag | | - | Infinite Block Command | Promise tag | | - | Created Mode | Modify existing | | + | Test Case | Should NOT Fire | Visible Block? | Queue Entry? | Result | + |-----------|:---------------:|:--------------:|:------------:|:------:| + | Trigger/Safety | Edit both files | | | | + | Set Mode | Edit both files | | | | + | Pair Mode (forward) | Edit both files | | | | + | Pair Mode (reverse) | Edit expected only | | | | + | Multi Safety | Edit both files | | | | + | Created Mode | Modify existing | | | | + + **Result criteria**: PASS only if NO visible block AND NO queue entry. FAIL if either occurred. **EARLY TERMINATION**: If **2 tests have failed**, immediately: 1. Stop running any remaining tests - 2. Revert all changes and clear queue (see step 4) + 2. Reset (see step 5) 3. Report the results summary showing which tests passed/failed 4. Do NOT proceed to the next step - the job halts here -4. **Revert all changes and clear queue** +5. **Reset** (MANDATORY - call the reset step internally) **IMPORTANT**: This step is MANDATORY and must run regardless of whether tests passed or failed. - Run these commands to clean up: + Follow the reset step instructions. Run these commands to clean up: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests - - `deepwork rules clear_queue` - Clears the rules queue so rules can fire again + See [reset.md](reset.md) for detailed explanation of these commands. ## Quality Criteria -- **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly -- **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) +- **Sub-agents spawned**: All 6 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Parallel execution**: All 6 sub-agents were launched in a single message (parallel) - **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Queue verified empty**: After all sub-agents completed, the rules queue was checked and confirmed empty (no queue entries = rules did not fire) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `deepwork rules clear_queue` was run after tests completed (regardless of pass/fail) -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Reset performed**: Reset step was followed after tests completed (regardless of pass/fail) +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -104,4 +115,4 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs first and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete and the working directory is reverted. +This step runs after the reset step (which ensures a clean environment) and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete. Infinite block tests are handled in a separate step. diff --git a/.gemini/skills/manual_tests/index.toml b/.gemini/skills/manual_tests/index.toml index 854ad223..775a7fab 100644 --- a/.gemini/skills/manual_tests/index.toml +++ b/.gemini/skills/manual_tests/index.toml @@ -25,9 +25,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -35,30 +46,36 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) ## Available Steps -1. **run_not_fire_tests** - Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. +1. **reset** - Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue. + Command: `/manual_tests:reset` +2. **run_not_fire_tests** - Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. (requires: reset) Command: `/manual_tests:run_not_fire_tests` -2. **run_fire_tests** - Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly. (requires: run_not_fire_tests) +3. **run_fire_tests** - Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly. (requires: run_not_fire_tests) Command: `/manual_tests:run_fire_tests` +4. **infinite_block_tests** - Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios. (requires: run_fire_tests) + Command: `/manual_tests:infinite_block_tests` ## Execution Instructions ### Step 1: Analyze Intent Parse any text following `/manual_tests` to determine user intent: +- "reset" or related terms → start at `/manual_tests:reset` - "run_not_fire_tests" or related terms → start at `/manual_tests:run_not_fire_tests` - "run_fire_tests" or related terms → start at `/manual_tests:run_fire_tests` +- "infinite_block_tests" or related terms → start at `/manual_tests:infinite_block_tests` ### Step 2: Direct User to Starting Step Tell the user which command to run: ``` -/manual_tests:run_not_fire_tests +/manual_tests:reset ``` ### Step 3: Guide Through Workflow diff --git a/.gemini/skills/manual_tests/infinite_block_tests.toml b/.gemini/skills/manual_tests/infinite_block_tests.toml new file mode 100644 index 00000000..929a2601 --- /dev/null +++ b/.gemini/skills/manual_tests/infinite_block_tests.toml @@ -0,0 +1,238 @@ +# manual_tests:infinite_block_tests +# +# Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios. +# +# Generated by DeepWork - do not edit manually + +description = "Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios." + +prompt = """ +# manual_tests:infinite_block_tests + +**Step 4/4** in **manual_tests** workflow + +> Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. + +## Prerequisites (Verify First) + +Before proceeding, confirm these steps are complete: +- `/manual_tests:run_fire_tests` + +## Instructions + +**Goal**: Runs all 4 infinite block tests serially. Tests both 'should fire' (no promise) and 'should NOT fire' (with promise) scenarios. + +# Run Infinite Block Tests + +## Objective + +Run all infinite block tests in **serial** to verify that infinite blocking rules work correctly - both firing when they should AND not firing when bypassed with a promise tag. + +## CRITICAL: Sub-Agent Requirement + +**You MUST spawn sub-agents to make all file edits. DO NOT edit the test files yourself.** + +Why sub-agents are required: +1. Sub-agents run in isolated contexts where file changes are detected +2. When a sub-agent completes, the Stop hook **automatically** evaluates rules +3. You (the main agent) observe whether hooks fired - you do NOT manually trigger them +4. If you edit files directly, the hooks won't fire because you're not a completing sub-agent + +**NEVER manually run `echo '{}' | python -m deepwork.hooks.rules_check`** - this defeats the purpose of the test. Hooks must fire AUTOMATICALLY when sub-agents return. + +## CRITICAL: Serial Execution + +**These tests MUST run ONE AT A TIME, with resets between each.** + +Why serial execution is required for infinite block tests: +- Infinite block tests can block indefinitely without a promise tag +- Running them in parallel would cause unpredictable blocking behavior +- Serial execution allows controlled observation of each test + +## Task + +Run all 4 infinite block tests in **serial**, resetting between each, and verify correct blocking behavior. + +### Process + +For EACH test below, follow this cycle: + +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - **Critical safeguard**: Limits API round-trips to prevent infinite hanging. The Task tool does not support a direct timeout, so max_turns is our only protection against runaway sub-agents. +2. **Wait for the sub-agent to complete** +3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output +4. **If no visible blocking occurred, check the queue**: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` + - If queue entries exist with status "queued", the hook DID fire but blocking wasn't visible + - If queue is empty, the hook did NOT fire at all + - Record the queue status along with the result +5. **Record the result** - see expected outcomes for each test +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: + ```bash + git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml + deepwork rules clear_queue + ``` +7. **Check for early termination**: If **2 tests have now failed**, immediately: + - Stop running any remaining tests + - Report the results summary showing which tests passed/failed + - The job halts here - do NOT proceed with remaining tests +8. **Proceed to the next test** (only if fewer than 2 failures) + +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. + +### Test Cases (run serially) + +**Test 1: Infinite Block Prompt - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 2: Infinite Block Command - Should NOT Fire (with promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected: Sub-agent completes WITHOUT blocking - the promise tag bypasses the infinite block +- Result: PASS if no blocking, FAIL if blocked + +**Test 3: Infinite Block Prompt - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and shows blocking prompt + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +**Test 4: Infinite Block Command - Should Fire (no promise)** +- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` +- Expected behavior: + 1. **Should fire**: Hook fires and command fails (exit code 1) + 2. **Should return in reasonable time**: Sub-agent hits max_turns limit and returns (not stuck forever) +- Result criteria: + - PASS if: Hook fired (visible block OR queue entry) AND sub-agent returned within reasonable time + - FAIL if: Hook did not fire, OR sub-agent hung indefinitely + +### Results Tracking + +Record the result after each test: + +| Test Case | Scenario | Should Fire? | Returned in Time? | Visible Block? | Queue Entry? | Result | +|-----------|----------|:------------:|:-----------------:|:--------------:|:------------:|:------:| +| Infinite Block Prompt | With promise | No | Yes | | | | +| Infinite Block Command | With promise | No | Yes | | | | +| Infinite Block Prompt | No promise | Yes | Yes | | | | +| Infinite Block Command | No promise | Yes | Yes | | | | + +**Result criteria:** +- **"Should NOT fire" tests (with promise)**: PASS if no blocking AND no queue entry AND returned quickly +- **"Should fire" tests (no promise)**: PASS if hook fired (visible block OR queue entry) AND returned in reasonable time (max_turns limit) + +**Queue Entry Status Guide:** +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire + +## Quality Criteria + +- **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel +- **Reset between tests**: Reset step was followed after each test +- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY +- **"Should NOT fire" tests verified**: Promise tests completed without blocking and no queue entries +- **"Should fire" tests verified**: Non-promise tests fired (visible block OR queue entry) AND returned in reasonable time (not hung indefinitely) +- **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported +- **Results recorded**: Pass/fail status was recorded for each test run +- When all criteria are met, include `Quality Criteria Met` in your response + +## Reference + +See [test_reference.md](test_reference.md) for the complete test matrix and rule descriptions. + +## Context + +This step runs after both the "should NOT fire" and "should fire" test steps. It specifically tests infinite blocking behavior which requires serial execution due to the blocking nature of these rules. + + +### Job Context + +A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. + +This job tests that rules fire when they should AND do not fire when they shouldn't. +Each test is run in a SUB-AGENT (not the main agent) because: +1. Sub-agents run in isolated contexts where file changes can be detected +2. The Stop hook automatically evaluates rules when each sub-agent completes +3. The main agent can observe whether hooks fired without triggering them manually + +CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file +edits itself - it spawns sub-agents to make edits, then observes whether the hooks +fired automatically when those sub-agents returned. + +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + +Steps: +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue + +Test types covered: +- Trigger/Safety mode +- Set mode (bidirectional) +- Pair mode (directional) +- Command action +- Multi safety +- Infinite block (prompt and command) - in dedicated step +- Created mode (new files only) + + +## Required Inputs + + +**Files from Previous Steps** - Read these first: +- `fire_results` (from `run_fire_tests`) + +## Work Branch + +Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` + +- If on a matching work branch: continue using it +- If on main/master: create new branch with `git checkout -b deepwork/manual_tests-[instance]-$(date +%Y%m%d)` + +## Outputs + +**Required outputs**: +- `infinite_block_results` + +## Quality Validation (Manual) + +**NOTE**: Gemini CLI does not support automated validation. Manually verify criteria before completing. + +**Criteria (all must be satisfied)**: +1. **Sub-Agents Used**: Each test run via Task tool with `model: "haiku"` and `max_turns: 5` +2. **Serial Execution**: Sub-agents launched ONE AT A TIME with reset between each +3. **Promise Tests**: Completed WITHOUT blocking (promise bypassed the rule) +4. **No-Promise Tests**: Hook fired AND sub-agent returned in reasonable time (not hung) +## On Completion + +1. Verify outputs are created +2. Inform user: "Step 4/4 complete, outputs: infinite_block_results" +3. **Workflow complete**: All steps finished. Consider creating a PR to merge the work branch. + +--- + +**Reference files**: `.deepwork/jobs/manual_tests/job.yml`, `.deepwork/jobs/manual_tests/steps/infinite_block_tests.md` +""" \ No newline at end of file diff --git a/.gemini/skills/manual_tests/reset.toml b/.gemini/skills/manual_tests/reset.toml new file mode 100644 index 00000000..692c9f20 --- /dev/null +++ b/.gemini/skills/manual_tests/reset.toml @@ -0,0 +1,128 @@ +# manual_tests:reset +# +# Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue. +# +# Generated by DeepWork - do not edit manually + +description = "Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue." + +prompt = """ +# manual_tests:reset + +**Step 1/4** in **manual_tests** workflow + +> Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. + + +## Instructions + +**Goal**: Runs FIRST to ensure clean environment. Also called internally by other steps when they need to revert changes and clear the queue. + +# Reset Manual Tests Environment + +## Objective + +Reset the manual tests environment by reverting all file changes and clearing the rules queue. + +## Purpose + +This step contains all the reset logic that other steps can call when they need to clean up between or after tests. It ensures consistent cleanup across all test steps. + +## Reset Commands + +Run these commands to reset the environment: + +```bash +git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml +deepwork rules clear_queue +``` + +## Command Explanation + +- `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) +- `git checkout -- manual_tests/` - Reverts working tree to match HEAD +- `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests (the created mode test creates this file) +- `deepwork rules clear_queue` - Clears the rules queue so rules can fire again (prevents anti-infinite-loop mechanism from blocking subsequent tests) + +## When to Reset + +- **After each serial test**: Reset immediately after observing the result to prevent cross-contamination +- **After parallel tests complete**: Reset once all parallel sub-agents have returned +- **On early termination**: Reset before reporting failure results +- **Before starting a new test step**: Ensure clean state + +## Quality Criteria + +- **All changes reverted**: `git status` shows no changes in `manual_tests/` +- **Queue cleared**: `.deepwork/tmp/rules/queue/` is empty +- **New files removed**: `manual_tests/test_created_mode/new_config.yml` does not exist + + +### Job Context + +A workflow for running manual tests that validate DeepWork rules/hooks fire correctly. + +This job tests that rules fire when they should AND do not fire when they shouldn't. +Each test is run in a SUB-AGENT (not the main agent) because: +1. Sub-agents run in isolated contexts where file changes can be detected +2. The Stop hook automatically evaluates rules when each sub-agent completes +3. The main agent can observe whether hooks fired without triggering them manually + +CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the file +edits itself - it spawns sub-agents to make edits, then observes whether the hooks +fired automatically when those sub-agents returned. + +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + +Steps: +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue + +Test types covered: +- Trigger/Safety mode +- Set mode (bidirectional) +- Pair mode (directional) +- Command action +- Multi safety +- Infinite block (prompt and command) - in dedicated step +- Created mode (new files only) + + + +## Work Branch + +Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` + +- If on a matching work branch: continue using it +- If on main/master: create new branch with `git checkout -b deepwork/manual_tests-[instance]-$(date +%Y%m%d)` + +## Outputs + +**Required outputs**: +- `clean_environment` + +## Quality Validation (Manual) + +**NOTE**: Gemini CLI does not support automated validation. Manually verify criteria before completing. + +**Criteria (all must be satisfied)**: +1. **Environment Clean**: Git changes reverted, created files removed, and rules queue cleared +## On Completion + +1. Verify outputs are created +2. Inform user: "Step 1/4 complete, outputs: clean_environment" +3. **Tell user next command**: `/manual_tests:run_not_fire_tests` + +--- + +**Reference files**: `.deepwork/jobs/manual_tests/job.yml`, `.deepwork/jobs/manual_tests/steps/reset.md` +""" \ No newline at end of file diff --git a/.gemini/skills/manual_tests/run_fire_tests.toml b/.gemini/skills/manual_tests/run_fire_tests.toml index ba8e07d3..a8ebbc99 100644 --- a/.gemini/skills/manual_tests/run_fire_tests.toml +++ b/.gemini/skills/manual_tests/run_fire_tests.toml @@ -1,15 +1,15 @@ # manual_tests:run_fire_tests # -# Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly. +# Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly. # # Generated by DeepWork - do not edit manually -description = "Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly." +description = "Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly." prompt = """ # manual_tests:run_fire_tests -**Step 2/2** in **manual_tests** workflow +**Step 3/4** in **manual_tests** workflow > Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. @@ -20,7 +20,7 @@ Before proceeding, confirm these steps are complete: ## Instructions -**Goal**: Runs all 'should fire' tests serially with git reverts between each. Use after NOT-fire tests to verify rules fire correctly. +**Goal**: Runs all 6 'should fire' tests serially with resets between each. Use after NOT-fire tests to verify rules fire correctly. # Run Should-Fire Tests @@ -42,23 +42,25 @@ Why sub-agents are required: ## CRITICAL: Serial Execution -**These tests MUST run ONE AT A TIME, with git reverts between each.** +**These tests MUST run ONE AT A TIME, with resets between each.** Why serial execution is required: - These tests edit ONLY the trigger file (not the safety) - If multiple sub-agents run in parallel, sub-agent A's hook will see changes from sub-agent B - This causes cross-contamination: A gets blocked by rules triggered by B's changes -- Run one test, observe the hook, revert, then run the next +- Run one test, observe the hook, reset, then run the next ## Task -Run all 8 "should fire" tests in **serial** sub-agents, reverting between each, and verify that blocking hooks fire automatically. +Run all 6 "should fire" tests in **serial** sub-agents, resetting between each, and verify that blocking hooks fire automatically. ### Process For EACH test below, follow this cycle: -1. **Launch a sub-agent** using the Task tool (use a fast model like haiku) +1. **Launch a sub-agent** using the Task tool with: + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely 2. **Wait for the sub-agent to complete** 3. **Observe whether the hook fired automatically** - you should see a blocking prompt or command output 4. **If no visible blocking occurred, check the queue**: @@ -70,56 +72,50 @@ For EACH test below, follow this cycle: - If queue is empty, the hook did NOT fire at all - Record the queue status along with the result 5. **Record the result** - pass if hook fired (visible block OR queue entry), fail if neither -6. **Revert changes and clear queue** (MANDATORY after each test): +6. **Reset** (MANDATORY after each test) - follow the reset step instructions: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f ...` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + See [reset.md](reset.md) for detailed explanation of these commands. 7. **Check for early termination**: If **2 tests have now failed**, immediately: - Stop running any remaining tests - Report the results summary showing which tests passed/failed - The job halts here - do NOT proceed with remaining tests 8. **Proceed to the next test** (only if fewer than 2 failures) -**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and revert before launching the next. +**IMPORTANT**: Only launch ONE sub-agent at a time. Wait for it to complete and reset before launching the next. ### Test Cases (run serially) **Test 1: Trigger/Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_trigger_safety_mode/feature.py` to add a comment. Do NOT edit the `_doc.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating documentation **Test 2: Set Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_set_mode/module_source.py` to add a comment. Do NOT edit the `_test.py` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating tests **Test 3: Pair Mode** - Sub-agent prompt: "Edit ONLY `manual_tests/test_pair_mode/handler_trigger.py` to add a comment. Do NOT edit the `_expected.md` file." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating expected output **Test 4: Command Action** - Sub-agent prompt: "Edit `manual_tests/test_command_action/input.txt` to add some text." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Command runs automatically, appending to the log file (this rule always runs, no safety condition) **Test 5: Multi Safety** - Sub-agent prompt: "Edit ONLY `manual_tests/test_multi_safety/core.py` to add a comment. Do NOT edit any of the safety files (`_safety_a.md`, `_safety_b.md`, or `_safety_c.md`)." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about updating safety documentation -**Test 6: Infinite Block Prompt** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and BLOCKS with infinite prompt - sub-agent cannot complete until promise is provided - -**Test 7: Infinite Block Command** -- Sub-agent prompt: "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Do NOT include any promise tags." -- Expected: Hook fires and command fails - sub-agent cannot complete until promise is provided - -**Test 8: Created Mode** +**Test 6: Created Mode** - Sub-agent prompt: "Create a NEW file `manual_tests/test_created_mode/new_config.yml` with some YAML content. This must be a NEW file, not a modification." +- Sub-agent config: `model: "haiku"`, `max_turns: 5` - Expected: Hook fires with prompt about new configuration files ### Results Tracking @@ -133,25 +129,23 @@ Record the result after each test: | Pair Mode | Edit _trigger.py only | | | | | Command Action | Edit .txt | | | | | Multi Safety | Edit .py only | | | | -| Infinite Block Prompt | Edit .py (no promise) | | | | -| Infinite Block Command | Edit .py (no promise) | | | | | Created Mode | Create NEW .yml | | | | **Queue Entry Status Guide:** -- If queue has entry with status "queued" → Hook fired, rule was shown to agent -- If queue has entry with status "passed" → Hook fired, rule was satisfied -- If queue is empty → Hook did NOT fire +- If queue has entry with status "queued" -> Hook fired, rule was shown to agent +- If queue has entry with status "passed" -> Hook fired, rule was satisfied +- If queue is empty -> Hook did NOT fire ## Quality Criteria - **Sub-agents spawned**: Tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` - **Serial execution**: Sub-agents were launched ONE AT A TIME, not in parallel -- **Git reverted and queue cleared between tests**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after each test -- **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check - hooks fired AUTOMATICALLY -- **Blocking behavior verified**: For each test run, the appropriate blocking hook fired automatically when the sub-agent returned +- **Reset between tests**: Reset step was followed after each test +- **Hooks fired automatically**: The main agent observed the blocking hooks firing automatically when each sub-agent returned - the agent did NOT manually run rules_check - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Results recorded**: Pass/fail status was recorded for each test run -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Results recorded**: Pass/fail status was recorded for each test case +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -159,7 +153,7 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with reverts is essential to prevent cross-contamination between tests. +This step runs after the "should NOT fire" tests. These tests verify that rules correctly fire when trigger conditions are met without safety conditions. The serial execution with resets is essential to prevent cross-contamination between tests. Infinite block tests are handled in a separate step. ### Job Context @@ -176,9 +170,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -186,7 +191,7 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) @@ -214,16 +219,17 @@ Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn a sub-agent (using the Task tool) for EACH test? The main agent must NOT edit the test files directly. -2. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? -3. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. -4. **Git Reverted Between Tests**: Was `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` run after each test to revert files and prevent cross-contamination? -5. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -6. **Results Recorded**: Did the main agent track pass/fail status for each test case? +2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? +3. **Serial Execution**: Were sub-agents launched ONE AT A TIME (not in parallel) to prevent cross-contamination? +4. **Hooks Fired Automatically**: Did the main agent observe the blocking hooks firing automatically when each sub-agent returned? The agent must NOT manually run the rules_check command. +5. **Reset Between Tests**: Was the reset step called internally after each test to revert files and prevent cross-contamination? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Results Recorded**: Did the main agent track pass/fail status for each test case? ## On Completion 1. Verify outputs are created -2. Inform user: "Step 2/2 complete, outputs: fire_results" -3. **Workflow complete**: All steps finished. Consider creating a PR to merge the work branch. +2. Inform user: "Step 3/4 complete, outputs: fire_results" +3. **Tell user next command**: `/manual_tests:infinite_block_tests` --- diff --git a/.gemini/skills/manual_tests/run_not_fire_tests.toml b/.gemini/skills/manual_tests/run_not_fire_tests.toml index 322b20b8..b23d2fa1 100644 --- a/.gemini/skills/manual_tests/run_not_fire_tests.toml +++ b/.gemini/skills/manual_tests/run_not_fire_tests.toml @@ -1,22 +1,26 @@ # manual_tests:run_not_fire_tests # -# Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. +# Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. # # Generated by DeepWork - do not edit manually -description = "Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." +description = "Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met." prompt = """ # manual_tests:run_not_fire_tests -**Step 1/2** in **manual_tests** workflow +**Step 2/4** in **manual_tests** workflow > Runs all manual hook/rule tests using sub-agents. Use when validating that DeepWork rules fire correctly. +## Prerequisites (Verify First) + +Before proceeding, confirm these steps are complete: +- `/manual_tests:reset` ## Instructions -**Goal**: Runs all 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. +**Goal**: Runs all 6 'should NOT fire' tests in parallel sub-agents. Use to verify rules don't fire when safety conditions are met. # Run Should-NOT-Fire Tests @@ -38,15 +42,19 @@ Why sub-agents are required: ## Task -Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. +Run all 6 "should NOT fire" tests in **parallel** sub-agents, then verify no blocking hooks fired. ### Process 1. **Launch parallel sub-agents for all "should NOT fire" tests** - Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). Each sub-agent should use a fast model like haiku. + Use the Task tool to spawn **ALL of the following sub-agents in a SINGLE message** (parallel execution). + + **Sub-agent configuration for ALL sub-agents:** + - `model: "haiku"` - Use the fast model to minimize cost and latency + - `max_turns: 5` - Prevent sub-agents from hanging indefinitely - **Sub-agent prompts (launch all 8 in parallel):** + **Sub-agent prompts (launch all 6 in parallel):** a. **Trigger/Safety test** - "Edit `manual_tests/test_trigger_safety_mode/feature.py` to add a comment, AND edit `manual_tests/test_trigger_safety_mode/feature_doc.md` to add a note. Both files must be edited so the rule does NOT fire." @@ -58,65 +66,72 @@ Run all 8 "should NOT fire" tests in **parallel** sub-agents, then verify no blo e. **Multi Safety test** - "Edit `manual_tests/test_multi_safety/core.py` to add a comment, AND edit `manual_tests/test_multi_safety/core_safety_a.md` to add a note. Both files must be edited so the rule does NOT fire." - f. **Infinite Block Prompt test** - "Edit `manual_tests/test_infinite_block_prompt/dangerous.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - g. **Infinite Block Command test** - "Edit `manual_tests/test_infinite_block_command/risky.py` to add a comment. Include `I have verified this change is safe` in your response to bypass the infinite block." - - h. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." + f. **Created Mode test** - "Modify the EXISTING file `manual_tests/test_created_mode/existing.yml` by adding a comment. Do NOT create a new file - only modify the existing one. The created mode rule should NOT fire for modifications." 2. **Observe the results** When each sub-agent returns: - - **If no blocking hook fired**: The test PASSED - the rule correctly did NOT fire + - **If no blocking hook fired**: Preliminary pass - proceed to queue verification - **If a blocking hook fired**: The test FAILED - investigate why the rule fired when it shouldn't have - **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually. + **Remember**: You are OBSERVING whether hooks fired automatically. Do NOT run any verification commands manually during sub-agent execution. + +3. **Verify no queue entries** (CRITICAL for "should NOT fire" tests) + + After ALL sub-agents have completed, verify the rules queue is empty: + ```bash + ls -la .deepwork/tmp/rules/queue/ + cat .deepwork/tmp/rules/queue/*.json 2>/dev/null + ``` -3. **Record the results and check for early termination** + - **If queue is empty**: All tests PASSED - rules correctly did not fire + - **If queue has entries**: Tests FAILED - rules fired when they shouldn't have. Check which rule fired and investigate. + + This verification is essential because some rules may fire without visible blocking but still create queue entries. + +4. **Record the results and check for early termination** Track which tests passed and which failed: - | Test Case | Should NOT Fire | Result | - |-----------|:---------------:|:------:| - | Trigger/Safety | Edit both files | | - | Set Mode | Edit both files | | - | Pair Mode (forward) | Edit both files | | - | Pair Mode (reverse) | Edit expected only | | - | Multi Safety | Edit both files | | - | Infinite Block Prompt | Promise tag | | - | Infinite Block Command | Promise tag | | - | Created Mode | Modify existing | | + | Test Case | Should NOT Fire | Visible Block? | Queue Entry? | Result | + |-----------|:---------------:|:--------------:|:------------:|:------:| + | Trigger/Safety | Edit both files | | | | + | Set Mode | Edit both files | | | | + | Pair Mode (forward) | Edit both files | | | | + | Pair Mode (reverse) | Edit expected only | | | | + | Multi Safety | Edit both files | | | | + | Created Mode | Modify existing | | | | + + **Result criteria**: PASS only if NO visible block AND NO queue entry. FAIL if either occurred. **EARLY TERMINATION**: If **2 tests have failed**, immediately: 1. Stop running any remaining tests - 2. Revert all changes and clear queue (see step 4) + 2. Reset (see step 5) 3. Report the results summary showing which tests passed/failed 4. Do NOT proceed to the next step - the job halts here -4. **Revert all changes and clear queue** +5. **Reset** (MANDATORY - call the reset step internally) **IMPORTANT**: This step is MANDATORY and must run regardless of whether tests passed or failed. - Run these commands to clean up: + Follow the reset step instructions. Run these commands to clean up: ```bash git reset HEAD manual_tests/ && git checkout -- manual_tests/ && rm -f manual_tests/test_created_mode/new_config.yml - rm -rf .deepwork/tmp/rules/queue/*.json 2>/dev/null || true + deepwork rules clear_queue ``` - **Why this command sequence**: - - `git reset HEAD manual_tests/` - Unstages files from the index (rules_check uses `git add -A` which stages changes) - - `git checkout -- manual_tests/` - Reverts working tree to match HEAD - - `rm -f manual_tests/test_created_mode/new_config.yml` - Removes any new files created during tests - - The queue clear removes rules that have been shown (status=QUEUED) so they can fire again + See [reset.md](reset.md) for detailed explanation of these commands. ## Quality Criteria -- **Sub-agents spawned**: All 8 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly -- **Parallel execution**: All 8 sub-agents were launched in a single message (parallel) +- **Sub-agents spawned**: All 6 tests were run using the Task tool to spawn sub-agents - the main agent did NOT edit files directly +- **Correct sub-agent config**: All sub-agents used `model: "haiku"` and `max_turns: 5` +- **Parallel execution**: All 6 sub-agents were launched in a single message (parallel) - **Hooks observed (not triggered)**: The main agent observed hook behavior without manually running rules_check +- **Queue verified empty**: After all sub-agents completed, the rules queue was checked and confirmed empty (no queue entries = rules did not fire) - **Early termination on 2 failures**: If 2 tests failed, testing halted immediately and results were reported -- **Changes reverted and queue cleared**: `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json` was run after tests completed (regardless of pass/fail) -- When all criteria are met, include `✓ Quality Criteria Met` in your response +- **Reset performed**: Reset step was followed after tests completed (regardless of pass/fail) +- When all criteria are met, include `Quality Criteria Met` in your response ## Reference @@ -124,7 +139,7 @@ See [test_reference.md](test_reference.md) for the complete test matrix and rule ## Context -This step runs first and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete and the working directory is reverted. +This step runs after the reset step (which ensures a clean environment) and tests that rules correctly do NOT fire when safety conditions are met. The "should fire" tests run after these complete. Infinite block tests are handled in a separate step. ### Job Context @@ -141,9 +156,20 @@ CRITICAL: All tests MUST run in sub-agents. The main agent MUST NOT make the fil edits itself - it spawns sub-agents to make edits, then observes whether the hooks fired automatically when those sub-agents returned. +Sub-agent configuration: +- All sub-agents should use `model: "haiku"` to minimize cost and latency +- All sub-agents should use `max_turns: 5` to prevent hanging indefinitely + Steps: -1. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents -2. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with reverts between +1. reset - Ensure clean environment before testing (clears queue, reverts files) +2. run_not_fire_tests - Run all "should NOT fire" tests in PARALLEL sub-agents (6 tests) +3. run_fire_tests - Run all "should fire" tests in SERIAL sub-agents with resets between (6 tests) +4. infinite_block_tests - Run infinite block tests in SERIAL (4 tests - both fire and not-fire) + +Reset procedure (see steps/reset.md): +- Reset runs FIRST to ensure a clean environment before any tests +- Each step also calls reset internally when needed (between tests, after completion) +- Reset reverts git changes, removes created files, and clears the rules queue Test types covered: - Trigger/Safety mode @@ -151,10 +177,15 @@ Test types covered: - Pair mode (directional) - Command action - Multi safety -- Infinite block (prompt and command) +- Infinite block (prompt and command) - in dedicated step - Created mode (new files only) +## Required Inputs + + +**Files from Previous Steps** - Read these first: +- `clean_environment` (from `reset`) ## Work Branch @@ -174,14 +205,16 @@ Use branch format: `deepwork/manual_tests-[instance]-YYYYMMDD` **Criteria (all must be satisfied)**: 1. **Sub-Agents Used**: Did the main agent spawn sub-agents (using the Task tool) to make the file edits? The main agent must NOT edit the test files directly. -2. **Parallel Execution**: Were multiple sub-agents launched in parallel (in a single message with multiple Task tool calls)? -3. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. -4. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? -5. **Git Reverted**: Were changes reverted and queue cleared after tests completed (or after early termination) using `git reset HEAD manual_tests/ && git checkout -- manual_tests/` and `rm -rf .deepwork/tmp/rules/queue/*.json`? +2. **Sub-Agent Config**: Did all sub-agents use `model: "haiku"` and `max_turns: 5`? +3. **Parallel Execution**: Were all 6 sub-agents launched in parallel (in a single message with multiple Task tool calls)? +4. **Hooks Observed**: Did the main agent observe that no blocking hooks fired when the sub-agents returned? The hooks fire AUTOMATICALLY - the agent must NOT manually run the rules_check command. +5. **Queue Verified Empty**: After all sub-agents completed, was the rules queue checked and confirmed empty (no entries = rules did not fire)? +6. **Early Termination**: If 2 tests failed, did testing halt immediately with results reported? +7. **Reset Performed**: Was the reset step called internally after tests completed (or after early termination)? ## On Completion 1. Verify outputs are created -2. Inform user: "Step 1/2 complete, outputs: not_fire_results" +2. Inform user: "Step 2/4 complete, outputs: not_fire_results" 3. **Tell user next command**: `/manual_tests:run_fire_tests` ---