Skip to content

[FEATURE] Durable Steps/event loop execution (Epic) #757

@JackYPCOnline

Description

@JackYPCOnline

Durable Execution for Strands Agents — Tracking Ticket

Problem

The agent event loop runs entirely in-process. All state lives in memory — if the process crashes mid-loop, all progress is lost. Durability providers (Temporal, Dapr, AWS Lambda Durable) can only wrap the entire agent("prompt") as one opaque unit, meaning a crash replays everything from scratch, including already-completed model calls and non-idempotent tool calls.

Goal: Enable crash-resilient agent execution where completed model calls and tool calls are cached and never re-executed on recovery.

Design doc: strands-agents/docs#584


Desired User Experience

# Python
agent = Agent(
    model=BedrockModel(),
    tools=[search_flights, book_hotel],
    plugins=[TemporalDurabilityPlugin()],
)
// TypeScript — Worker (no sandbox)
const result = await agent.invokeWithCheckpoint(input.prompt, { checkpoint: input.checkpoint })

// TypeScript — Workflow (pure deterministic loop)
let result = await runAgentStep({ prompt })
while (!result.done) {
  result = await runAgentStep({ checkpoint: result.checkpoint })
}

On crash and replay: completed steps return cached results. Zero re-execution. The loop resumes from the last incomplete step.


Workstreams

WS1: Python SDK — Step Abstraction + Plugin Approach

Durability as a standard Plugin intercepting hook events. Agent code runs via Step.execute().

# Task Depends On Description
P1 Step dataclass New file strands/event_loop/step.py. Unit of I/O work (callable + args).
P2 Async hooks at model/tool call sites Switch invoke_callbacksawait invoke_callbacks_async at BeforeModelCallEvent and BeforeToolCallEvent.
P3 Writable step field on hook events P1 Add step: Step | None to BeforeModelCallEvent and BeforeToolCallEvent.
P4 Event loop wiring P1–P3 Refactor _handle_model_execution and _handle_tool_execution to create Step → fire hook → call event.step.execute(). Largest core change. All existing tests must pass.
P5 strands-temporal Python package scaffold New package with pyproject.toml, dependency on temporalio + strands-agents.
P6 TemporalDurabilityPlugin P4, P5 Plugin with @hook on BeforeModelCallEvent/BeforeToolCallEvent. Replaces event.step with wrapper dispatching to workflow.execute_activity().
P7 Integration test P6 Real Bedrock + real Temporal dev server. Crash simulation (CRASH_AFTER_ACTIVITY=N), verify completed steps not re-executed.
P8 Runnable sample P6 Travel planner agent in strands-agents/samples. Includes docker-compose.yml.
P9 User guide P8 "Durable Agents with Temporal" page on strandsagents.com.
P10 Update design doc PR #584 — (Week 1) Revise to reflect Step/Plugin approach. Remove old Durability ABC proposal.

WS2: TypeScript SDK — invokeWithCheckpoint() + Temporal Package

Rejects plugin-hook approach for Temporal (sandbox constraints). Checkpoint tokens returned from a new additive Agent method. Agent code runs in Temporal activities only.

# Task Depends On Description
T1 Checkpoint / CheckpointResult types New file src/agent/checkpoint.ts. Per-tool granularity via nextToolIndex.
T2 invokeWithCheckpoint() on Agent T1 New method. One unit of I/O work per call. Deferred message append.
T3 Hook integration T2 Existing hooks (BeforeModelCallEvent, etc.) still fire in checkpoint mode.
T4 strands-temporal TS package scaffold Separate package, no temporalio dep in core SDK.
T5 runAgentStep activity + per-run-ID registry T2, T4 Fix prototype's singleton agent. Key agents by workflow run ID.
T6 durableAgentWorkflow T5 Pure deterministic loop passing checkpoint tokens.
T7 StrandsWorker helper T5 Registers activity + workflow. Users bring tools; worker resolves by name.
T8 Integration test + crash simulation T6 Per-tool granularity: 3 tools = 3 ActivityTaskCompleted entries. Crash after tool 2 → only tool 3 re-runs. Concurrent workflow isolation.
T9 Migrate prototype to examples/temporal/ T8 Clean up, update for nextToolIndex, add README + docker-compose.yml.
T10 User guide T9 "Durable Agents with Temporal" page. Architecture diagram, quick start, limitations.
T11 MCP support in durable context T5 See details below.

WS3: External Outreach (post-implementation)

  • AWS Lambda Durable team — File GitHub issue on aws/aws-durable-execution-sdk-python requesting async handler/step support. Blocked on sync-only Python SDK.
  • Temporal team — Share working integration + sample after strands-temporal is published.

Task 11: MCP Support in Durable Context

The previous "MCP limitation — cannot cross activity boundary" was wrong. MCP server config crosses the boundary. MCP clients are reconstructed inside activities. Remote MCP servers work.

Corrected MCP Gap Analysis

Gap Reality
MCP cannot cross activity boundary MCP server config crosses the boundary. MCP clients are reconstructed inside activities. Remote MCP servers work.
Stdio MCP servers Not durable — subprocess dies with worker process. Documented limitation.
Long-running workflow connection timeouts Needs reconnect-on-failure in McpClient. Separate issue.

Subtasks

  • Add mcpServers?: McpServerConfig[] to run_config
  • Update getOrCreate() to construct McpClient instances from config
  • Verify agent.initialize() correctly connects and lists tools inside activity
  • Integration test: MCP tool call cached across crash, remote MCP server not re-called
  • Document stdio MCP limitation explicitly
  • Document connection timeout handling requirement for long-running workflows

Bottom Line

Remote MCP works. Stdio MCP doesn't. The architecture doesn't need to change — just the run_config shape and getOrCreate() implementation.


Key Design Decisions

Decision Python TypeScript
Durability mechanism Plugin intercepting hook events Checkpoint tokens from invokeWithCheckpoint()
Agent code location Runs via Step.execute() Runs in Temporal activities only (never in sandbox)
Checkpoint granularity Per I/O call (one model call OR one tool call) Per I/O call via nextToolIndex cursor
Breaking changes None — additive step field on events None — invoke() / stream() unchanged
MCP support Follow-up Task 11 — remote MCP works, stdio doesn't

Known Gaps

Gap Impact Status
Model objects hold clients that can't serialize Activity must reconstruct model from config By design
AgentState replay may not be deterministic Tools writing to agent.state may diverge Snapshot session manager addresses this
Human-in-the-loop Strands Interrupt ≠ Temporal Signal Documented limitation
Streaming during replay No token stream on cached steps — UX gap, not correctness Document; use AfterInvocationEvent for final-result callbacks
Stdio MCP servers Subprocess dies with worker process Documented limitation; remote MCP works
Long-running MCP connection timeouts Needs reconnect-on-failure in McpClient Separate issue

Out of Scope (Follow-up)

  • Full state machine refactor with explicit CycleState enum / transition table (Step is the first building block)
  • strands-dapr package (same plugin pattern, lower priority)
  • strands-aws Lambda Durable package (blocked on async SDK support)
  • AgentCore runtime validation
  • Streaming callback behavior during replay (UX documentation only)
  • McpClient reconnect-on-failure for long-running workflows

Timeline

5 weeks per workstream (can run in parallel):

Week Python SDK TypeScript SDK
Week 1 P10: Design doc PR #584 merged T1–T2: Checkpoint types + invokeWithCheckpoint()
Week 2 P1–P4: Step + hooks + event loop wiring T3: Hook integration verified
Week 3 P5–P7: strands-temporal + integration test T4–T6: strands-temporal package + activity + workflow
Week 4 P8–P9: Sample + user guide T7–T8: Worker helper + crash recovery test
Week 5 Buffer — review, polish, outreach T9–T11: Example + user guide + MCP support

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions