feat(chat): resilient hot-reload recovery for decopilot runs

# Plan: Resilient Hot-Reload Recovery for Decopilot Runs

## Context

When the dev server hot-reloads (`bun --hot`) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread stays `in_progress` in the DB because `stopAll()` fires `FORCE_FAIL` as fire-and-forget (async reactor may not complete before the process is replaced).

The frontend detects `isRunInProgress`, tries `/attach` which returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.

**Answer to the question**: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (`~/.claude/projects/`), so thread context survives restarts. We can leverage this by sending a "continue" message with context.

## Approach: Detect Ghost + Auto-Continue

1. **Server startup**: Sweep DB for ghost threads (`in_progress` with no run in registry) and mark them as interrupted
2. **Frontend**: When ghost detected, replace "No response was generated" with a "Continue" button that sends a contextual resume message

## Changes

### 1. Add `listByStatus()` to thread storage

**File**: `apps/mesh/src/storage/threads.ts` + `apps/mesh/src/storage/ports.ts`

Add method to find all ghost threads on startup:
```typescript
listByStatus(status: string): Promise<Array<{ id: string; organization_id: string }>>
// SELECT id, organization_id FROM thread WHERE status = $1
```

### 2. Server startup ghost-run sweep

**File**: `apps/mesh/src/api/app.ts` (~line 318, after RunRegistry creation)

After creating the RunRegistry, run an async sweep:
```typescript
// Fire-and-forget: clean up any threads left in_progress from previous process
threadStorage.listByStatus("in_progress").then(async (ghosts) => {
  for (const ghost of ghosts) {
    await threadStorage.update(ghost.id, ghost.organization_id, { status: "failed" });
    sseHub.emit(ghost.organization_id, createDecopilotThreadStatusEvent(ghost.id, "failed"));
    sseHub.emit(ghost.organization_id, createDecopilotFinishEvent(ghost.id, "failed"));
    console.warn("[decopilot] Cleaned up ghost run", { threadId: ghost.id });
  }
}).catch(err => console.error("[decopilot] Ghost sweep failed", err));
```

This runs once on startup, non-blocking. Any thread stuck as `in_progress` without a corresponding run is a ghost.

### 3. Frontend: auto-cancel on resume failure (fast ghost resolution)

**File**: `apps/mesh/src/web/components/chat/chat-provider.tsx` (TaskStreamManager, line ~129)

When `tryResumeStream` fails (which means `/attach` returned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:

```typescript
// In the .catch handler after resume fails:
chatStore.cancelRun(); // triggers ghost detection server-side (routes.ts:391-413)
```

The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.

### 4. "Continue" button in EmptyAssistantState

**File**: `apps/mesh/src/web/components/chat/message/assistant.tsx` (line 370)

Replace the static `EmptyAssistantState` with a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:

> "The previous run was interrupted by a server restart. Please continue where you left off. Here's a brief summary of what was being done: [last user message content]"

Implementation:
- `EmptyAssistantState` needs access to: whether this is the last pair, the thread status (failed), and the user's last message
- Pass `isLast` and the user message from `MessagePair` props down to `MessageAssistant`
- When `isLast && message === null && !isLoading && thread.status === "failed"`:
  - Show "Run was interrupted" text
  - Render a "Continue" button that calls `chatStore.sendMessage()` with a pre-built continuation prompt
  - The prompt includes the last user message text for context

```tsx
function EmptyAssistantState({ isLast, userMessage }: { isLast: boolean; userMessage?: ChatMessage }) {
  const threadStatus = useChatStore(s => {
    const thread = s.threads.find(t => t.id === s.activeThreadId);
    return thread?.status;
  });

  // Ghost/interrupted run — show continue button
  if (isLast && threadStatus === "failed" && userMessage) {
    const userText = userMessage.parts
      ?.filter(p => p.type === "text")
      .map(p => p.text)
      .join(" ")
      .slice(0, 200);

    return (
      <div className="flex flex-col gap-2 py-2">
        <div className="text-[14px] text-muted-foreground/60">
          Run was interrupted by a server restart
        </div>
        <button
          className="text-[13px] text-primary hover:underline self-start"
          onClick={() => {
            chatStore.sendMessage({
              parts: [{ type: "text", text: `The previous run was interrupted. Please continue where you left off. The original request was: "${userText}"` }],
            });
          }}
        >
          Continue conversation
        </button>
      </div>
    );
  }

  return (
    <div className="text-[14px] text-muted-foreground/60 py-2">
      No response was generated
    </div>
  );
}
```

**Prop threading**:
- `MessagePair` component (pair.tsx:59) already has `pair.user` — pass it to `MessageAssistant`
- `MessageAssistant` passes it to `EmptyAssistantState` when rendering the empty state

### 5. Pass user message through component tree

**File**: `apps/mesh/src/web/components/chat/message/pair.tsx` (line 89)

Add `userMessage` prop to `MessageAssistant`:
```tsx
<MessageAssistant
  message={pair.assistant}
  userMessage={pair.user}  // NEW
  status={status}
  isLast={isLastPair}
  isPlanMode={isPlanMode}
/>
```

**File**: `apps/mesh/src/web/components/chat/message/assistant.tsx`

Add `userMessage` to `MessageAssistant` props and pass it to `EmptyAssistantState`.

## Files to modify

| File | Change |
|------|--------|
| `apps/mesh/src/storage/ports.ts` | Add `listByStatus()` to `ThreadStoragePort` |
| `apps/mesh/src/storage/threads.ts` | Implement `listByStatus()` query |
| `apps/mesh/src/api/app.ts` | Add startup ghost sweep (~line 318) |
| `apps/mesh/src/web/components/chat/chat-provider.tsx` | Auto-cancel on first resume failure |
| `apps/mesh/src/web/components/chat/message/assistant.tsx` | "Continue" button in `EmptyAssistantState` |
| `apps/mesh/src/web/components/chat/message/pair.tsx` | Pass `userMessage` to `MessageAssistant` |

## Edge cases

- **Multiple ghosts**: Startup sweep handles all in one pass
- **Concurrent hot reloads**: Force-fail is idempotent (`in_progress` -> `failed` transition only)
- **SSE reconnect**: EventSource auto-reconnects after restart; ghost sweep SSE events emit after hub is ready
- **Partial messages**: Any messages saved at 5-step checkpoints survive; the gap between last checkpoint and crash is lost (acceptable for dev)
- **Non-interrupted failures**: The "Continue" button only shows when `isLast && message === null && threadStatus === "failed"` — regular failures with partial responses won't trigger it (they have content)
- **Claude Code memory**: The SDK stores session history at `~/.claude/projects/`, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session files

## Verification

1. Start a Claude Code run that takes time (e.g., "search the codebase for all TODO comments")
2. While streaming, save a file to trigger hot reload
3. Expected: within 1-2s, the thread transitions to "failed"
4. UI shows "Run was interrupted by a server restart" + "Continue conversation" button
5. Click "Continue" — sends a message with context, agent picks up where it left off

## Future: True Resume (out of scope for now)

The Claude Agent SDK supports `resume: sessionId` + `resumeSessionAt: messageUuid`. A future enhancement could:
- Store a unique session UUID per thread (instead of `session_id: "chat"`)
- On restart, re-spawn the agent with `resume` to continue from where it left off
- Re-stream the resumed output to the client

This is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.

File	Change
`apps/mesh/src/storage/ports.ts`	Add `listByStatus()` to `ThreadStoragePort`
`apps/mesh/src/storage/threads.ts`	Implement `listByStatus()` query
`apps/mesh/src/api/app.ts`	Add startup ghost sweep (~line 318)
`apps/mesh/src/web/components/chat/chat-provider.tsx`	Auto-cancel on first resume failure
`apps/mesh/src/web/components/chat/message/assistant.tsx`	"Continue" button in `EmptyAssistantState`
`apps/mesh/src/web/components/chat/message/pair.tsx`	Pass `userMessage` to `MessageAssistant`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chat): resilient hot-reload recovery for decopilot runs #2711

Plan: Resilient Hot-Reload Recovery for Decopilot Runs

Context

Approach: Detect Ghost + Auto-Continue

Changes

1. Add `listByStatus()` to thread storage

2. Server startup ghost-run sweep

3. Frontend: auto-cancel on resume failure (fast ghost resolution)

4. "Continue" button in EmptyAssistantState

5. Pass user message through component tree

Files to modify

Edge cases

Verification

Future: True Resume (out of scope for now)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(chat): resilient hot-reload recovery for decopilot runs #2711

Description

Plan: Resilient Hot-Reload Recovery for Decopilot Runs

Context

Approach: Detect Ghost + Auto-Continue

Changes

1. Add listByStatus() to thread storage

2. Server startup ghost-run sweep

3. Frontend: auto-cancel on resume failure (fast ghost resolution)

4. "Continue" button in EmptyAssistantState

5. Pass user message through component tree

Files to modify

Edge cases

Verification

Future: True Resume (out of scope for now)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Add `listByStatus()` to thread storage