Skip to content

Comments

fix: prevent worker memory leaks with recycling, memory logging, and Sentry scope clearing#771

Open
autumnbillinginternalapp[bot] wants to merge 2 commits intomainfrom
jj/worker-memory-leak-fixes
Open

fix: prevent worker memory leaks with recycling, memory logging, and Sentry scope clearing#771
autumnbillinginternalapp[bot] wants to merge 2 commits intomainfrom
jj/worker-memory-leak-fixes

Conversation

@autumnbillinginternalapp
Copy link

@autumnbillinginternalapp autumnbillinginternalapp bot commented Feb 19, 2026

What

Prevents the silent worker death we saw today (workers stopped processing SQS messages after hours of uptime due to memory growth).

Changes

  1. Process recycling — workers exit gracefully after 500k messages. The cluster primary automatically respawns them with fresh memory. This is the safety net.

  2. Memory logging — adds rss and heapUsed to the periodic stats log line so we can see memory growth in Axiom and catch leaks before they cause outages.

  3. Sentry scope clearing — calls Sentry.getCurrentScope().clear() after each batch to prevent breadcrumb/tag accumulation from thousands of setSentryTags() calls.

Context

Today at ~09:21 UTC, Worker 52 silently stopped receiving SQS messages. Worker 53 followed at ~11:12 UTC. No errors, no crashes — the workers kept polling but got empty responses for 8 hours until a restart at 19:21 fixed it. ECS metrics showed memory climbing from 33% → 44% max before failure.

Testing

Minimal risk — only adds logging, a scope clear, and a graceful exit path that the existing cluster respawn logic already handles.


Summary by cubic

Prevents worker memory leaks that stalled SQS polling by adding process recycling, memory usage logging, and clearing Sentry scope after batches. Workers now exit and respawn predictably, and memory growth is visible.

  • Bug Fixes
    • Exit with code 0 after 500k messages; cluster primary auto-respawns a fresh worker.
    • Log rss, heapUsed, and total messages in periodic stats to catch growth early.
    • Clear Sentry scope after each batch to prevent breadcrumb/tag buildup.

Written for commit 532b9e8. Summary will update on new commits.

Greptile Summary

Addresses production memory leak causing workers to silently stop processing SQS messages after hours of uptime by adding three defensive mechanisms.

  • Bug Fixes
    • Added process recycling: workers gracefully exit after 500k messages; cluster primary auto-respawns with fresh memory as safety net against leaks
    • Added memory logging: logs rss and heapUsed in periodic stats to catch memory growth early in Axiom before outages occur
    • Added Sentry scope clearing: calls Sentry.getCurrentScope().clear() after each batch to prevent breadcrumb/tag accumulation from thousands of setSentryTags() calls

The changes directly respond to the incident at ~09:21 UTC where Worker 52 stopped receiving messages and Worker 53 followed, with ECS metrics showing memory climbing from 33% to 44% before failure. The worker recycling acts as a safety net, while memory logging provides observability, and Sentry scope clearing prevents a known source of memory accumulation.

Confidence Score: 5/5

  • Safe to merge - implements defensive, low-risk observability and graceful recycling mechanisms
  • All three changes are defensive additions with minimal risk: process recycling reuses existing cluster respawn logic, memory logging is read-only observability, and Sentry scope clearing is a standard cleanup pattern. No logic changes to message processing.
  • No files require special attention

Important Files Changed

Filename Overview
server/src/queue/initWorkers.ts Added process recycling after 500k messages, memory logging (rss/heap), and Sentry scope clearing to prevent worker memory leaks

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Worker Polls SQS Queue] --> B{Messages Received?}
    B -->|Yes| C[Process Messages in Batch]
    C --> D[Delete Messages from Queue]
    D --> E[Sentry.getCurrentScope.clear]
    E --> F{totalMessagesProcessed >= 500k?}
    F -->|Yes| G[Set isRunning = false]
    G --> H[Break Polling Loop]
    H --> I[Worker Process Exits]
    I --> J[Cluster Primary Detects Exit]
    J --> K[cluster.fork - Spawn New Worker]
    K --> A
    F -->|No| L[Continue Polling]
    L --> A
    B -->|No| M[Handle Empty Poll]
    M --> A
Loading

Last reviewed commit: eb4b0ad

…Sentry scope clearing

- Add process recycling after 500k messages (cluster primary auto-respawns)
- Add memory stats (rss/heap) to periodic log output for leak detection
- Clear Sentry scope after each batch to prevent breadcrumb/tag accumulation
- Track total messages processed per worker lifetime

Context: Workers silently stopped receiving SQS messages after hours of
uptime due to memory growth. Memory climbed from 33% to 44% max before
workers became unresponsive. Only a full restart recovered them.
@vercel
Copy link

vercel bot commented Feb 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
autumn-vite Ready Ready Preview, Comment Feb 19, 2026 10:59pm

Request Review

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Confidence score: 3/5

  • There is a concrete lifecycle risk in server/src/queue/initWorkers.ts: recycling stops polling but never exits, so workers can sit idle and not be respawned, undermining the memory-reclaim goal.
  • This is a medium-severity behavior change (6/10) with user-impacting implications for worker availability, so the merge carries some risk.
  • Pay close attention to server/src/queue/initWorkers.ts - ensure recycling triggers the shutdown/exit path so workers actually restart.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/queue/initWorkers.ts">

<violation number="1" location="server/src/queue/initWorkers.ts:261">
P2: Recycling stops the polling loop but never exits the process, so the worker will sit idle and won’t be respawned. Trigger the existing shutdown path (SIGTERM) or exit the process here so recycling actually frees memory.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Addresses review feedback — setting isRunning=false only stops the loop
but leaves the process alive and idle. process.exit(0) ensures the
cluster primary detects the exit and respawns a fresh worker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants