Skip to content

feat(chat): resilient hot-reload recovery for decopilot runs #2711

@vibegui

Description

@vibegui

Plan: Resilient Hot-Reload Recovery for Decopilot Runs

Context

When the dev server hot-reloads (bun --hot) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread stays in_progress in the DB because stopAll() fires FORCE_FAIL as fire-and-forget (async reactor may not complete before the process is replaced).

The frontend detects isRunInProgress, tries /attach which returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.

Answer to the question: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (~/.claude/projects/), so thread context survives restarts. We can leverage this by sending a "continue" message with context.

Approach: Detect Ghost + Auto-Continue

  1. Server startup: Sweep DB for ghost threads (in_progress with no run in registry) and mark them as interrupted
  2. Frontend: When ghost detected, replace "No response was generated" with a "Continue" button that sends a contextual resume message

Changes

1. Add listByStatus() to thread storage

File: apps/mesh/src/storage/threads.ts + apps/mesh/src/storage/ports.ts

Add method to find all ghost threads on startup:

listByStatus(status: string): Promise<Array<{ id: string; organization_id: string }>>
// SELECT id, organization_id FROM thread WHERE status = $1

2. Server startup ghost-run sweep

File: apps/mesh/src/api/app.ts (~line 318, after RunRegistry creation)

After creating the RunRegistry, run an async sweep:

// Fire-and-forget: clean up any threads left in_progress from previous process
threadStorage.listByStatus("in_progress").then(async (ghosts) => {
  for (const ghost of ghosts) {
    await threadStorage.update(ghost.id, ghost.organization_id, { status: "failed" });
    sseHub.emit(ghost.organization_id, createDecopilotThreadStatusEvent(ghost.id, "failed"));
    sseHub.emit(ghost.organization_id, createDecopilotFinishEvent(ghost.id, "failed"));
    console.warn("[decopilot] Cleaned up ghost run", { threadId: ghost.id });
  }
}).catch(err => console.error("[decopilot] Ghost sweep failed", err));

This runs once on startup, non-blocking. Any thread stuck as in_progress without a corresponding run is a ghost.

3. Frontend: auto-cancel on resume failure (fast ghost resolution)

File: apps/mesh/src/web/components/chat/chat-provider.tsx (TaskStreamManager, line ~129)

When tryResumeStream fails (which means /attach returned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:

// In the .catch handler after resume fails:
chatStore.cancelRun(); // triggers ghost detection server-side (routes.ts:391-413)

The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.

4. "Continue" button in EmptyAssistantState

File: apps/mesh/src/web/components/chat/message/assistant.tsx (line 370)

Replace the static EmptyAssistantState with a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:

"The previous run was interrupted by a server restart. Please continue where you left off. Here's a brief summary of what was being done: [last user message content]"

Implementation:

  • EmptyAssistantState needs access to: whether this is the last pair, the thread status (failed), and the user's last message
  • Pass isLast and the user message from MessagePair props down to MessageAssistant
  • When isLast && message === null && !isLoading && thread.status === "failed":
    • Show "Run was interrupted" text
    • Render a "Continue" button that calls chatStore.sendMessage() with a pre-built continuation prompt
    • The prompt includes the last user message text for context
function EmptyAssistantState({ isLast, userMessage }: { isLast: boolean; userMessage?: ChatMessage }) {
  const threadStatus = useChatStore(s => {
    const thread = s.threads.find(t => t.id === s.activeThreadId);
    return thread?.status;
  });

  // Ghost/interrupted run — show continue button
  if (isLast && threadStatus === "failed" && userMessage) {
    const userText = userMessage.parts
      ?.filter(p => p.type === "text")
      .map(p => p.text)
      .join(" ")
      .slice(0, 200);

    return (
      <div className="flex flex-col gap-2 py-2">
        <div className="text-[14px] text-muted-foreground/60">
          Run was interrupted by a server restart
        </div>
        <button
          className="text-[13px] text-primary hover:underline self-start"
          onClick={() => {
            chatStore.sendMessage({
              parts: [{ type: "text", text: `The previous run was interrupted. Please continue where you left off. The original request was: "${userText}"` }],
            });
          }}
        >
          Continue conversation
        </button>
      </div>
    );
  }

  return (
    <div className="text-[14px] text-muted-foreground/60 py-2">
      No response was generated
    </div>
  );
}

Prop threading:

  • MessagePair component (pair.tsx:59) already has pair.user — pass it to MessageAssistant
  • MessageAssistant passes it to EmptyAssistantState when rendering the empty state

5. Pass user message through component tree

File: apps/mesh/src/web/components/chat/message/pair.tsx (line 89)

Add userMessage prop to MessageAssistant:

<MessageAssistant
  message={pair.assistant}
  userMessage={pair.user}  // NEW
  status={status}
  isLast={isLastPair}
  isPlanMode={isPlanMode}
/>

File: apps/mesh/src/web/components/chat/message/assistant.tsx

Add userMessage to MessageAssistant props and pass it to EmptyAssistantState.

Files to modify

File Change
apps/mesh/src/storage/ports.ts Add listByStatus() to ThreadStoragePort
apps/mesh/src/storage/threads.ts Implement listByStatus() query
apps/mesh/src/api/app.ts Add startup ghost sweep (~line 318)
apps/mesh/src/web/components/chat/chat-provider.tsx Auto-cancel on first resume failure
apps/mesh/src/web/components/chat/message/assistant.tsx "Continue" button in EmptyAssistantState
apps/mesh/src/web/components/chat/message/pair.tsx Pass userMessage to MessageAssistant

Edge cases

  • Multiple ghosts: Startup sweep handles all in one pass
  • Concurrent hot reloads: Force-fail is idempotent (in_progress -> failed transition only)
  • SSE reconnect: EventSource auto-reconnects after restart; ghost sweep SSE events emit after hub is ready
  • Partial messages: Any messages saved at 5-step checkpoints survive; the gap between last checkpoint and crash is lost (acceptable for dev)
  • Non-interrupted failures: The "Continue" button only shows when isLast && message === null && threadStatus === "failed" — regular failures with partial responses won't trigger it (they have content)
  • Claude Code memory: The SDK stores session history at ~/.claude/projects/, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session files

Verification

  1. Start a Claude Code run that takes time (e.g., "search the codebase for all TODO comments")
  2. While streaming, save a file to trigger hot reload
  3. Expected: within 1-2s, the thread transitions to "failed"
  4. UI shows "Run was interrupted by a server restart" + "Continue conversation" button
  5. Click "Continue" — sends a message with context, agent picks up where it left off

Future: True Resume (out of scope for now)

The Claude Agent SDK supports resume: sessionId + resumeSessionAt: messageUuid. A future enhancement could:

  • Store a unique session UUID per thread (instead of session_id: "chat")
  • On restart, re-spawn the agent with resume to continue from where it left off
  • Re-stream the resumed output to the client

This is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions