fix: prevent worker memory leaks with recycling, memory logging, and Sentry scope clearing#771
Open
autumnbillinginternalapp[bot] wants to merge 2 commits intomainfrom
Open
fix: prevent worker memory leaks with recycling, memory logging, and Sentry scope clearing#771autumnbillinginternalapp[bot] wants to merge 2 commits intomainfrom
autumnbillinginternalapp[bot] wants to merge 2 commits intomainfrom
Conversation
…Sentry scope clearing - Add process recycling after 500k messages (cluster primary auto-respawns) - Add memory stats (rss/heap) to periodic log output for leak detection - Clear Sentry scope after each batch to prevent breadcrumb/tag accumulation - Track total messages processed per worker lifetime Context: Workers silently stopped receiving SQS messages after hours of uptime due to memory growth. Memory climbed from 33% to 44% max before workers became unresponsive. Only a full restart recovered them.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
There was a problem hiding this comment.
1 issue found across 1 file
Confidence score: 3/5
- There is a concrete lifecycle risk in
server/src/queue/initWorkers.ts: recycling stops polling but never exits, so workers can sit idle and not be respawned, undermining the memory-reclaim goal. - This is a medium-severity behavior change (6/10) with user-impacting implications for worker availability, so the merge carries some risk.
- Pay close attention to
server/src/queue/initWorkers.ts- ensure recycling triggers the shutdown/exit path so workers actually restart.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/queue/initWorkers.ts">
<violation number="1" location="server/src/queue/initWorkers.ts:261">
P2: Recycling stops the polling loop but never exits the process, so the worker will sit idle and won’t be respawned. Trigger the existing shutdown path (SIGTERM) or exit the process here so recycling actually frees memory.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Addresses review feedback — setting isRunning=false only stops the loop but leaves the process alive and idle. process.exit(0) ensures the cluster primary detects the exit and respawns a fresh worker.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Prevents the silent worker death we saw today (workers stopped processing SQS messages after hours of uptime due to memory growth).
Changes
Process recycling — workers exit gracefully after 500k messages. The cluster primary automatically respawns them with fresh memory. This is the safety net.
Memory logging — adds
rssandheapUsedto the periodic stats log line so we can see memory growth in Axiom and catch leaks before they cause outages.Sentry scope clearing — calls
Sentry.getCurrentScope().clear()after each batch to prevent breadcrumb/tag accumulation from thousands ofsetSentryTags()calls.Context
Today at ~09:21 UTC, Worker 52 silently stopped receiving SQS messages. Worker 53 followed at ~11:12 UTC. No errors, no crashes — the workers kept polling but got empty responses for 8 hours until a restart at 19:21 fixed it. ECS metrics showed memory climbing from 33% → 44% max before failure.
Testing
Minimal risk — only adds logging, a scope clear, and a graceful exit path that the existing cluster respawn logic already handles.
Summary by cubic
Prevents worker memory leaks that stalled SQS polling by adding process recycling, memory usage logging, and clearing Sentry scope after batches. Workers now exit and respawn predictably, and memory growth is visible.
Written for commit 532b9e8. Summary will update on new commits.
Greptile Summary
Addresses production memory leak causing workers to silently stop processing SQS messages after hours of uptime by adding three defensive mechanisms.
rssandheapUsedin periodic stats to catch memory growth early in Axiom before outages occurSentry.getCurrentScope().clear()after each batch to prevent breadcrumb/tag accumulation from thousands ofsetSentryTags()callsThe changes directly respond to the incident at ~09:21 UTC where Worker 52 stopped receiving messages and Worker 53 followed, with ECS metrics showing memory climbing from 33% to 44% before failure. The worker recycling acts as a safety net, while memory logging provides observability, and Sentry scope clearing prevents a known source of memory accumulation.
Confidence Score: 5/5
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Worker Polls SQS Queue] --> B{Messages Received?} B -->|Yes| C[Process Messages in Batch] C --> D[Delete Messages from Queue] D --> E[Sentry.getCurrentScope.clear] E --> F{totalMessagesProcessed >= 500k?} F -->|Yes| G[Set isRunning = false] G --> H[Break Polling Loop] H --> I[Worker Process Exits] I --> J[Cluster Primary Detects Exit] J --> K[cluster.fork - Spawn New Worker] K --> A F -->|No| L[Continue Polling] L --> A B -->|No| M[Handle Empty Poll] M --> ALast reviewed commit: eb4b0ad