-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Plan: Resilient Hot-Reload Recovery for Decopilot Runs
Context
When the dev server hot-reloads (bun --hot) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread stays in_progress in the DB because stopAll() fires FORCE_FAIL as fire-and-forget (async reactor may not complete before the process is replaced).
The frontend detects isRunInProgress, tries /attach which returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.
Answer to the question: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (~/.claude/projects/), so thread context survives restarts. We can leverage this by sending a "continue" message with context.
Approach: Detect Ghost + Auto-Continue
- Server startup: Sweep DB for ghost threads (
in_progresswith no run in registry) and mark them as interrupted - Frontend: When ghost detected, replace "No response was generated" with a "Continue" button that sends a contextual resume message
Changes
1. Add listByStatus() to thread storage
File: apps/mesh/src/storage/threads.ts + apps/mesh/src/storage/ports.ts
Add method to find all ghost threads on startup:
listByStatus(status: string): Promise<Array<{ id: string; organization_id: string }>>
// SELECT id, organization_id FROM thread WHERE status = $12. Server startup ghost-run sweep
File: apps/mesh/src/api/app.ts (~line 318, after RunRegistry creation)
After creating the RunRegistry, run an async sweep:
// Fire-and-forget: clean up any threads left in_progress from previous process
threadStorage.listByStatus("in_progress").then(async (ghosts) => {
for (const ghost of ghosts) {
await threadStorage.update(ghost.id, ghost.organization_id, { status: "failed" });
sseHub.emit(ghost.organization_id, createDecopilotThreadStatusEvent(ghost.id, "failed"));
sseHub.emit(ghost.organization_id, createDecopilotFinishEvent(ghost.id, "failed"));
console.warn("[decopilot] Cleaned up ghost run", { threadId: ghost.id });
}
}).catch(err => console.error("[decopilot] Ghost sweep failed", err));This runs once on startup, non-blocking. Any thread stuck as in_progress without a corresponding run is a ghost.
3. Frontend: auto-cancel on resume failure (fast ghost resolution)
File: apps/mesh/src/web/components/chat/chat-provider.tsx (TaskStreamManager, line ~129)
When tryResumeStream fails (which means /attach returned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:
// In the .catch handler after resume fails:
chatStore.cancelRun(); // triggers ghost detection server-side (routes.ts:391-413)The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.
4. "Continue" button in EmptyAssistantState
File: apps/mesh/src/web/components/chat/message/assistant.tsx (line 370)
Replace the static EmptyAssistantState with a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:
"The previous run was interrupted by a server restart. Please continue where you left off. Here's a brief summary of what was being done: [last user message content]"
Implementation:
EmptyAssistantStateneeds access to: whether this is the last pair, the thread status (failed), and the user's last message- Pass
isLastand the user message fromMessagePairprops down toMessageAssistant - When
isLast && message === null && !isLoading && thread.status === "failed":- Show "Run was interrupted" text
- Render a "Continue" button that calls
chatStore.sendMessage()with a pre-built continuation prompt - The prompt includes the last user message text for context
function EmptyAssistantState({ isLast, userMessage }: { isLast: boolean; userMessage?: ChatMessage }) {
const threadStatus = useChatStore(s => {
const thread = s.threads.find(t => t.id === s.activeThreadId);
return thread?.status;
});
// Ghost/interrupted run — show continue button
if (isLast && threadStatus === "failed" && userMessage) {
const userText = userMessage.parts
?.filter(p => p.type === "text")
.map(p => p.text)
.join(" ")
.slice(0, 200);
return (
<div className="flex flex-col gap-2 py-2">
<div className="text-[14px] text-muted-foreground/60">
Run was interrupted by a server restart
</div>
<button
className="text-[13px] text-primary hover:underline self-start"
onClick={() => {
chatStore.sendMessage({
parts: [{ type: "text", text: `The previous run was interrupted. Please continue where you left off. The original request was: "${userText}"` }],
});
}}
>
Continue conversation
</button>
</div>
);
}
return (
<div className="text-[14px] text-muted-foreground/60 py-2">
No response was generated
</div>
);
}Prop threading:
MessagePaircomponent (pair.tsx:59) already haspair.user— pass it toMessageAssistantMessageAssistantpasses it toEmptyAssistantStatewhen rendering the empty state
5. Pass user message through component tree
File: apps/mesh/src/web/components/chat/message/pair.tsx (line 89)
Add userMessage prop to MessageAssistant:
<MessageAssistant
message={pair.assistant}
userMessage={pair.user} // NEW
status={status}
isLast={isLastPair}
isPlanMode={isPlanMode}
/>File: apps/mesh/src/web/components/chat/message/assistant.tsx
Add userMessage to MessageAssistant props and pass it to EmptyAssistantState.
Files to modify
| File | Change |
|---|---|
apps/mesh/src/storage/ports.ts |
Add listByStatus() to ThreadStoragePort |
apps/mesh/src/storage/threads.ts |
Implement listByStatus() query |
apps/mesh/src/api/app.ts |
Add startup ghost sweep (~line 318) |
apps/mesh/src/web/components/chat/chat-provider.tsx |
Auto-cancel on first resume failure |
apps/mesh/src/web/components/chat/message/assistant.tsx |
"Continue" button in EmptyAssistantState |
apps/mesh/src/web/components/chat/message/pair.tsx |
Pass userMessage to MessageAssistant |
Edge cases
- Multiple ghosts: Startup sweep handles all in one pass
- Concurrent hot reloads: Force-fail is idempotent (
in_progress->failedtransition only) - SSE reconnect: EventSource auto-reconnects after restart; ghost sweep SSE events emit after hub is ready
- Partial messages: Any messages saved at 5-step checkpoints survive; the gap between last checkpoint and crash is lost (acceptable for dev)
- Non-interrupted failures: The "Continue" button only shows when
isLast && message === null && threadStatus === "failed"— regular failures with partial responses won't trigger it (they have content) - Claude Code memory: The SDK stores session history at
~/.claude/projects/, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session files
Verification
- Start a Claude Code run that takes time (e.g., "search the codebase for all TODO comments")
- While streaming, save a file to trigger hot reload
- Expected: within 1-2s, the thread transitions to "failed"
- UI shows "Run was interrupted by a server restart" + "Continue conversation" button
- Click "Continue" — sends a message with context, agent picks up where it left off
Future: True Resume (out of scope for now)
The Claude Agent SDK supports resume: sessionId + resumeSessionAt: messageUuid. A future enhancement could:
- Store a unique session UUID per thread (instead of
session_id: "chat") - On restart, re-spawn the agent with
resumeto continue from where it left off - Re-stream the resumed output to the client
This is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.