Skip to content

refactor: domain-driven prompt-response correlation via ExecutionEvent#794

Open
BlueHotDog wants to merge 8 commits intomainfrom
issue-767-test-add-integration-test
Open

refactor: domain-driven prompt-response correlation via ExecutionEvent#794
BlueHotDog wants to merge 8 commits intomainfrom
issue-767-test-add-integration-test

Conversation

@BlueHotDog
Copy link
Copy Markdown
Collaborator

@BlueHotDog BlueHotDog commented Apr 6, 2026

Summary

  • Introduce ExecutionEvent domain struct as the ACL boundary between SwarmAi infrastructure events and Frontman domain events
  • Thread interaction_id (causation) through execution metadata so completion events carry context back to the channel
  • Replace fragile pending_prompt_id scalar with a correlated pending_prompt map (interaction_id + jsonrpc_id)
  • Reject concurrent prompts entirely — no message persisted when execution is already running
  • Extract add_user_message as a domain primitive (persist without execution), compose submit_user_message from it
  • Rename Execution.handle_swarm_event/3Execution.classify_event/1 accepting %ExecutionEvent{} directly

Motivation

Devin review flagged a bug: when a second session/prompt arrives while an agent is running, pending_prompt_id gets overwritten. The first prompt's JSON-RPC response is lost. Root cause: the transport layer was manually mirroring domain execution state via a bare scalar.

Rather than patching the symptom, this PR fixes the architecture: the domain now carries its own correlation context through the execution lifecycle, and the transport maps domain IDs to protocol IDs.

Test plan

  • 825 tests pass, 0 failures
  • Updated execution_test.exs: {:ok, :already_running}{:error, :already_running}, verify rejected message is not persisted
  • Updated task_channel_test.exs: all event builders use ExecutionEvent, pending_prompt_idpending_prompt
  • Updated error_propagation_test.exs, execution_sentry_test.exs: {:swarm_event, ...}{:execution_event, %ExecutionEvent{...}}
  • Updated tasks_channel_test.exs: use add_user_message for history-only tests
  • Removed "cancel does not interfere with subsequent prompts" test (premise invalid with reject-entirely semantics)

🤖 Generated with Claude Code

devin-ai-integration[bot]

This comment was marked as resolved.

@BlueHotDog BlueHotDog force-pushed the issue-767-test-add-integration-test branch from 2c89b4d to c031117 Compare April 6, 2026 20:07
devin-ai-integration[bot]

This comment was marked as resolved.

@BlueHotDog BlueHotDog changed the title test: add integration test for concurrent execution prevention refactor: domain-driven prompt-response correlation via ExecutionEvent Apr 8, 2026
BlueHotDog and others added 5 commits April 8, 2026 15:00
Surface :already_running from submit_user_message when an agent is
already executing for the task, and add an integration test that
exercises the guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip title generation and log accurately when submit_user_message
returns {:ok, :already_running} instead of falling through to the
catch-all {:ok, _interaction} clause.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- execution_test: use chained setup matching main's pattern
- tasks_test: use add_agent_response for sequence test to avoid
  already_running from blocking agent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…correlation

Replace the fragile `pending_prompt_id` scalar in TaskChannel with a
correlated `pending_prompt` map that ties JSON-RPC request IDs to
domain interaction IDs.

**Domain changes:**
- Add `ExecutionEvent` struct as the ACL boundary between SwarmAi
  infrastructure events and Frontman domain events. Carries `caused_by`
  (the interaction_id of the triggering UserMessage).
- Thread `interaction_id` through execution metadata so completion
  events carry causation context back to the channel.
- `submit_user_message` now rejects entirely when an execution is
  already running — no message persisted, immediate error response.
- Extract `add_user_message` as a domain primitive for persisting
  messages without starting execution.
- SwarmDispatcher wraps raw swarm events into `ExecutionEvent` before
  PubSub broadcast.
- Rename `Execution.handle_swarm_event/3` to `classify_event/1`,
  accepting `%ExecutionEvent{}` directly.

**Transport changes:**
- Replace `pending_prompt_id` with `pending_prompt` map containing
  both `interaction_id` and `jsonrpc_id`.
- Channel receives `{:execution_event, %ExecutionEvent{}}` instead
  of `{:swarm_event, {type, payload}}`.
- `resolve_pending_prompt` verifies causation match via `caused_by`.

Fixes the bug where a second prompt during execution overwrote the
pending prompt ID, causing the first prompt's response to be lost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 'crashed agent' and 'failed agent' e2e tests (added in #798) still
asserted on the raw {:swarm_event, ...} tuple. After the ExecutionEvent
refactor, SwarmDispatcher now broadcasts {:execution_event, %ExecutionEvent{}}
instead. Update both assert_receive calls to match the new domain event format.
@BlueHotDog BlueHotDog force-pushed the issue-767-test-add-integration-test branch from 3022668 to 0c630f9 Compare April 8, 2026 15:41
devin-ai-integration[bot]

This comment was marked as resolved.

@BlueHotDog
Copy link
Copy Markdown
Collaborator Author

Fixed caused_by propagation in handle_transient_error — passing event.caused_by through as a required 3rd arg so the exhausted-retry path calls finalize_turn/3 with the correct causation context, enabling the mismatch check in resolve_pending_prompt.

BlueHotDog and others added 3 commits April 8, 2026 18:19
When retries exhaust, finalize_turn was called without caused_by,
silently skipping the causation mismatch check in resolve_pending_prompt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng pending prompt

Add test verifying that when all retries are exhausted the pending
JSON-RPC prompt is resolved with an error response, covering the
caused_by propagation path through handle_transient_error.

Also remove the caused_by \\ nil default from finalize_turn/3 — the
two callsites without an execution event now pass nil explicitly,
making the arity mandatory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
These functions operate solely on ExecutionEvent data — moving them there
eliminates Feature Envy and keeps domain semantics co-located with the type.

Callers (TaskChannel, SwarmDispatcher) now call ExecutionEvent.classify/1
and ExecutionEvent.classify_error/1 directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant