Durable Execution for Strands Agents — Tracking Ticket
Problem
The agent event loop runs entirely in-process. All state lives in memory — if the process crashes mid-loop, all progress is lost. Durability providers (Temporal, Dapr, AWS Lambda Durable) can only wrap the entire agent("prompt") as one opaque unit, meaning a crash replays everything from scratch, including already-completed model calls and non-idempotent tool calls.
Goal: Enable crash-resilient agent execution where completed model calls and tool calls are cached and never re-executed on recovery.
Design doc: strands-agents/docs#584
Desired User Experience
# Python
agent = Agent(
model=BedrockModel(),
tools=[search_flights, book_hotel],
plugins=[TemporalDurabilityPlugin()],
)
// TypeScript — Worker (no sandbox)
const result = await agent.invokeWithCheckpoint(input.prompt, { checkpoint: input.checkpoint })
// TypeScript — Workflow (pure deterministic loop)
let result = await runAgentStep({ prompt })
while (!result.done) {
result = await runAgentStep({ checkpoint: result.checkpoint })
}
On crash and replay: completed steps return cached results. Zero re-execution. The loop resumes from the last incomplete step.
Workstreams
WS1: Python SDK — Step Abstraction + Plugin Approach
Durability as a standard Plugin intercepting hook events. Agent code runs via Step.execute().
| # |
Task |
Depends On |
Description |
| P1 |
Step dataclass |
— |
New file strands/event_loop/step.py. Unit of I/O work (callable + args). |
| P2 |
Async hooks at model/tool call sites |
— |
Switch invoke_callbacks → await invoke_callbacks_async at BeforeModelCallEvent and BeforeToolCallEvent. |
| P3 |
Writable step field on hook events |
P1 |
Add step: Step | None to BeforeModelCallEvent and BeforeToolCallEvent. |
| P4 |
Event loop wiring |
P1–P3 |
Refactor _handle_model_execution and _handle_tool_execution to create Step → fire hook → call event.step.execute(). Largest core change. All existing tests must pass. |
| P5 |
strands-temporal Python package scaffold |
— |
New package with pyproject.toml, dependency on temporalio + strands-agents. |
| P6 |
TemporalDurabilityPlugin |
P4, P5 |
Plugin with @hook on BeforeModelCallEvent/BeforeToolCallEvent. Replaces event.step with wrapper dispatching to workflow.execute_activity(). |
| P7 |
Integration test |
P6 |
Real Bedrock + real Temporal dev server. Crash simulation (CRASH_AFTER_ACTIVITY=N), verify completed steps not re-executed. |
| P8 |
Runnable sample |
P6 |
Travel planner agent in strands-agents/samples. Includes docker-compose.yml. |
| P9 |
User guide |
P8 |
"Durable Agents with Temporal" page on strandsagents.com. |
| P10 |
Update design doc PR #584 |
— (Week 1) |
Revise to reflect Step/Plugin approach. Remove old Durability ABC proposal. |
WS2: TypeScript SDK — invokeWithCheckpoint() + Temporal Package
Rejects plugin-hook approach for Temporal (sandbox constraints). Checkpoint tokens returned from a new additive Agent method. Agent code runs in Temporal activities only.
| # |
Task |
Depends On |
Description |
| T1 |
Checkpoint / CheckpointResult types |
— |
New file src/agent/checkpoint.ts. Per-tool granularity via nextToolIndex. |
| T2 |
invokeWithCheckpoint() on Agent |
T1 |
New method. One unit of I/O work per call. Deferred message append. |
| T3 |
Hook integration |
T2 |
Existing hooks (BeforeModelCallEvent, etc.) still fire in checkpoint mode. |
| T4 |
strands-temporal TS package scaffold |
— |
Separate package, no temporalio dep in core SDK. |
| T5 |
runAgentStep activity + per-run-ID registry |
T2, T4 |
Fix prototype's singleton agent. Key agents by workflow run ID. |
| T6 |
durableAgentWorkflow |
T5 |
Pure deterministic loop passing checkpoint tokens. |
| T7 |
StrandsWorker helper |
T5 |
Registers activity + workflow. Users bring tools; worker resolves by name. |
| T8 |
Integration test + crash simulation |
T6 |
Per-tool granularity: 3 tools = 3 ActivityTaskCompleted entries. Crash after tool 2 → only tool 3 re-runs. Concurrent workflow isolation. |
| T9 |
Migrate prototype to examples/temporal/ |
T8 |
Clean up, update for nextToolIndex, add README + docker-compose.yml. |
| T10 |
User guide |
T9 |
"Durable Agents with Temporal" page. Architecture diagram, quick start, limitations. |
| T11 |
MCP support in durable context |
T5 |
See details below. |
WS3: External Outreach (post-implementation)
- AWS Lambda Durable team — File GitHub issue on
aws/aws-durable-execution-sdk-python requesting async handler/step support. Blocked on sync-only Python SDK.
- Temporal team — Share working integration + sample after
strands-temporal is published.
Task 11: MCP Support in Durable Context
The previous "MCP limitation — cannot cross activity boundary" was wrong. MCP server config crosses the boundary. MCP clients are reconstructed inside activities. Remote MCP servers work.
Corrected MCP Gap Analysis
| Gap |
Reality |
| MCP cannot cross activity boundary |
MCP server config crosses the boundary. MCP clients are reconstructed inside activities. Remote MCP servers work. |
| Stdio MCP servers |
Not durable — subprocess dies with worker process. Documented limitation. |
| Long-running workflow connection timeouts |
Needs reconnect-on-failure in McpClient. Separate issue. |
Subtasks
Bottom Line
Remote MCP works. Stdio MCP doesn't. The architecture doesn't need to change — just the run_config shape and getOrCreate() implementation.
Key Design Decisions
| Decision |
Python |
TypeScript |
| Durability mechanism |
Plugin intercepting hook events |
Checkpoint tokens from invokeWithCheckpoint() |
| Agent code location |
Runs via Step.execute() |
Runs in Temporal activities only (never in sandbox) |
| Checkpoint granularity |
Per I/O call (one model call OR one tool call) |
Per I/O call via nextToolIndex cursor |
| Breaking changes |
None — additive step field on events |
None — invoke() / stream() unchanged |
| MCP support |
Follow-up |
Task 11 — remote MCP works, stdio doesn't |
Known Gaps
| Gap |
Impact |
Status |
| Model objects hold clients that can't serialize |
Activity must reconstruct model from config |
By design |
AgentState replay may not be deterministic |
Tools writing to agent.state may diverge |
Snapshot session manager addresses this |
| Human-in-the-loop |
Strands Interrupt ≠ Temporal Signal |
Documented limitation |
| Streaming during replay |
No token stream on cached steps — UX gap, not correctness |
Document; use AfterInvocationEvent for final-result callbacks |
| Stdio MCP servers |
Subprocess dies with worker process |
Documented limitation; remote MCP works |
| Long-running MCP connection timeouts |
Needs reconnect-on-failure in McpClient |
Separate issue |
Out of Scope (Follow-up)
- Full state machine refactor with explicit
CycleState enum / transition table (Step is the first building block)
strands-dapr package (same plugin pattern, lower priority)
strands-aws Lambda Durable package (blocked on async SDK support)
- AgentCore runtime validation
- Streaming callback behavior during replay (UX documentation only)
- McpClient reconnect-on-failure for long-running workflows
Timeline
5 weeks per workstream (can run in parallel):
| Week |
Python SDK |
TypeScript SDK |
| Week 1 |
P10: Design doc PR #584 merged |
T1–T2: Checkpoint types + invokeWithCheckpoint() |
| Week 2 |
P1–P4: Step + hooks + event loop wiring |
T3: Hook integration verified |
| Week 3 |
P5–P7: strands-temporal + integration test |
T4–T6: strands-temporal package + activity + workflow |
| Week 4 |
P8–P9: Sample + user guide |
T7–T8: Worker helper + crash recovery test |
| Week 5 |
Buffer — review, polish, outreach |
T9–T11: Example + user guide + MCP support |
Durable Execution for Strands Agents — Tracking Ticket
Problem
The agent event loop runs entirely in-process. All state lives in memory — if the process crashes mid-loop, all progress is lost. Durability providers (Temporal, Dapr, AWS Lambda Durable) can only wrap the entire
agent("prompt")as one opaque unit, meaning a crash replays everything from scratch, including already-completed model calls and non-idempotent tool calls.Goal: Enable crash-resilient agent execution where completed model calls and tool calls are cached and never re-executed on recovery.
Design doc: strands-agents/docs#584
Desired User Experience
On crash and replay: completed steps return cached results. Zero re-execution. The loop resumes from the last incomplete step.
Workstreams
WS1: Python SDK — Step Abstraction + Plugin Approach
Durability as a standard Plugin intercepting hook events. Agent code runs via
Step.execute().Stepdataclassstrands/event_loop/step.py. Unit of I/O work (callable + args).invoke_callbacks→await invoke_callbacks_asyncatBeforeModelCallEventandBeforeToolCallEvent.stepfield on hook eventsstep: Step | NonetoBeforeModelCallEventandBeforeToolCallEvent._handle_model_executionand_handle_tool_executionto create Step → fire hook → callevent.step.execute(). Largest core change. All existing tests must pass.strands-temporalPython package scaffoldpyproject.toml, dependency ontemporalio+strands-agents.TemporalDurabilityPlugin@hookonBeforeModelCallEvent/BeforeToolCallEvent. Replacesevent.stepwith wrapper dispatching toworkflow.execute_activity().CRASH_AFTER_ACTIVITY=N), verify completed steps not re-executed.strands-agents/samples. Includesdocker-compose.yml.DurabilityABC proposal.WS2: TypeScript SDK —
invokeWithCheckpoint()+ Temporal PackageRejects plugin-hook approach for Temporal (sandbox constraints). Checkpoint tokens returned from a new additive Agent method. Agent code runs in Temporal activities only.
Checkpoint/CheckpointResulttypessrc/agent/checkpoint.ts. Per-tool granularity vianextToolIndex.invokeWithCheckpoint()on AgentBeforeModelCallEvent, etc.) still fire in checkpoint mode.strands-temporalTS package scaffoldtemporaliodep in core SDK.runAgentStepactivity + per-run-ID registrydurableAgentWorkflowStrandsWorkerhelperActivityTaskCompletedentries. Crash after tool 2 → only tool 3 re-runs. Concurrent workflow isolation.examples/temporal/nextToolIndex, add README +docker-compose.yml.WS3: External Outreach (post-implementation)
aws/aws-durable-execution-sdk-pythonrequesting async handler/step support. Blocked on sync-only Python SDK.strands-temporalis published.Task 11: MCP Support in Durable Context
The previous "MCP limitation — cannot cross activity boundary" was wrong. MCP server config crosses the boundary. MCP clients are reconstructed inside activities. Remote MCP servers work.
Corrected MCP Gap Analysis
Subtasks
mcpServers?: McpServerConfig[]torun_configgetOrCreate()to constructMcpClientinstances from configagent.initialize()correctly connects and lists tools inside activityBottom Line
Remote MCP works. Stdio MCP doesn't. The architecture doesn't need to change — just the
run_configshape andgetOrCreate()implementation.Key Design Decisions
invokeWithCheckpoint()Step.execute()nextToolIndexcursorstepfield on eventsinvoke()/stream()unchangedKnown Gaps
AgentStatereplay may not be deterministicagent.statemay divergeAfterInvocationEventfor final-result callbacksOut of Scope (Follow-up)
CycleStateenum / transition table (Step is the first building block)strands-daprpackage (same plugin pattern, lower priority)strands-awsLambda Durable package (blocked on async SDK support)Timeline
5 weeks per workstream (can run in parallel):
invokeWithCheckpoint()strands-temporal+ integration teststrands-temporalpackage + activity + workflow