Fix ss leak main #22669

XuPeng-SH · 2025-10-23T15:14:47Z

User description

What type of PR is this?

Which issue(s) this PR fixes:

issue #22666

What this PR does / why we need it:

Fix ss leak main

PR Type

Bug fix, Enhancement

Description

Implement unified GC scheduler to consolidate multiple goroutines into single scheduler
Replace manual snapshot partition management with new SnapshotManager for better lifecycle control
Add LRU eviction and age-based garbage collection for snapshot partitions
Introduce comprehensive metrics tracking for snapshot cache performance and GC operations

Diagram Walkthrough

flowchart LR
  A["Engine Initialization"] -->|"RunGCScheduler"| B["Unified GC Scheduler"]
  B -->|"Ticker: 20min"| C["Unused Table GC"]
  B -->|"Ticker: 20min"| D["Partition State GC"]
  B -->|"Ticker: 20min"| E["Snapshot GC"]
  F["Snapshot Operations"] -->|"Find/Add"| G["SnapshotManager"]
  G -->|"LRU + Age-based"| H["TrackedPartitions"]
  H -->|"Metrics"| I["SnapshotMetrics"]
  E -->|"MaybeStartGC"| G

File Walkthrough

Relevant files

Enhancement

tracked_partitions.go `New snapshot manager with LRU and age-based GC` pkg/vm/engine/disttae/tracked_partitions.go New file implementing SnapshotManager for centralized snapshot partition lifecycle management Provides TrackedPartition wrapper with access time tracking for LRU eviction Implements Find/Add operations with automatic metrics collection Includes GC with both age-based and count-based (LRU) eviction strategies Defines SnapshotGCConfig and SnapshotMetrics for configuration and observability	+426/-0
engine.go `Integrate unified GC scheduler into engine` pkg/vm/engine/disttae/engine.go Replace manual snapParts map with SnapshotManager initialization Add RunGCScheduler method that consolidates three separate GC goroutines into single scheduler Add gcPartitionState helper method for partition state garbage collection Remove individual GC ticker goroutines from InitLogTailPushModel	+46/-4
logtail_consumer.go `Remove individual GC goroutines from push client` pkg/vm/engine/disttae/logtail_consumer.go Add gcSnapshotTicker constant for snapshot GC interval Add TryGC method to PushClient for conditional GC execution Remove individual GC ticker goroutines from InitLogTailPushModel Consolidate GC logic into unified scheduler pattern	+13/-2
types.go `Replace manual snapshot map with manager` pkg/vm/engine/disttae/types.go Replace manual snapParts map structure with SnapshotManager pointer Remove nested sync.Mutex and partition slice from Engine struct Simplify snapshot partition management through manager abstraction	+3/-8
disttae_engine.go `Start GC scheduler in test engine initialization` pkg/vm/engine/test/testutil/disttae_engine.go Add RunGCScheduler call after InitLogTailPushModel in test engine setup Improve error handling for InitLogTailPushModel initialization Ensure GC scheduler is started in test environment	+8/-2
distributed_tae.go `Start unified GC scheduler in service` pkg/cnservice/distributed_tae.go Add RunGCScheduler call after InitLogTailPushModel in service initialization Start unified GC scheduler as separate goroutine	+3/-0

Tests

tracked_partitions_test.go `Complete test coverage for snapshot manager` pkg/vm/engine/disttae/tracked_partitions_test.go Comprehensive test suite for TrackedPartition, TrackedPartitions, and SnapshotManager Tests for LRU eviction, age-based GC, concurrent operations, and metrics tracking Includes benchmark tests for performance validation Tests edge cases like zero maxCount and empty table GC	+678/-0

Bug fix

db.go `Refactor snapshot partition caching to use manager` pkg/vm/engine/disttae/db.go Replace manual snapshot partition caching logic with SnapshotManager.Find/Add calls Remove sync.Mutex and nested struct for manual snapshot management Update error logging to use zap structured logging Add snapshot GC trigger via MaybeStartGC after adding new snapshots Remove panic path and add proper error handling for snapshot creation failures	+31/-24

qodo-merge-pro · 2025-10-23T15:15:30Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided `- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr>`
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
⚪	No custom compliance provided Follow the guide to enable custom compliance check.

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-merge-pro · 2025-10-23T15:16:42Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Add panic recovery to GC scheduler Add a deferred panic recovery mechanism to the `RunGCScheduler` function. This will prevent any panics within the GC tasks from crashing the service. pkg/vm/engine/disttae/engine.go [854-882] func (e Engine) RunGCScheduler(ctx context.Context) { + defer func() { + if r := recover(); r != nil { + logutil.Errorf("GC scheduler panic recovered: %v", r) + } + }() + unusedTableTicker := time.NewTicker(unsubscribeProcessTicker) partitionStateTicker := time.NewTicker(gcPartitionStateTicker) snapshotTicker := time.NewTicker(gcSnapshotTicker) defer unusedTableTicker.Stop() defer partitionStateTicker.Stop() defer snapshotTicker.Stop() for { select { case <-ctx.Done(): logutil.Infof("GC scheduler exit.") return case <-unusedTableTicker.C: // GC unused tables in PushClient e.pClient.TryGC(ctx) case <-partitionStateTicker.C: // GC partition states e.gcPartitionState(ctx) case <-snapshotTicker.C: // GC snapshot partitions e.snapshotMgr.MaybeStartGC() } } } Apply / Chat* Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that the new `RunGCScheduler` function, which runs in a goroutine, lacks panic recovery, posing a risk of crashing the entire service. Adding a `recover` block is a critical robustness improvement.	Medium
Possible issue	Add panic recovery to GC goroutine In `MaybeStartGC`, add a panic recovery mechanism to the asynchronously launched GC goroutine. This ensures that any panic is caught and logged, and that the `gcRunning` flag is correctly reset. pkg/vm/engine/disttae/tracked_partitions.go [289-314] func (sm SnapshotManager) MaybeStartGC() { if !sm.config.Enabled { return } // Check if enough time has passed since last GC lastGCTime := time.Unix(0, sm.state.lastGCTime.Load()) if time.Since(lastGCTime) < sm.config.GCInterval { return } // Check if GC is already running if sm.state.gcRunning.Load() { return } // Update last GC time and mark as running sm.state.lastGCTime.Store(time.Now().UnixNano()) sm.state.gcRunning.Store(true) // Run GC asynchronously go func() { - defer sm.state.gcRunning.Store(false) + defer func() { + sm.state.gcRunning.Store(false) + if r := recover(); r != nil { + logutil.Error("Snapshot GC panic recovered", zap.Any("panic", r)) + } + }() sm.RunGC() }() } Apply / Chat* Suggestion importance[1-10]: 5 __ Why: The suggestion's reasoning that a panic would leave `gcRunning` as true is incorrect, as the existing `defer` statement would still execute. However, adding a `recover` block is still a valid improvement for robustness and logging, so the suggestion has moderate value.	Low
More

XuPeng-SH added 4 commits October 23, 2025 23:11

update

8257526

update

97c46ce

update

448be77

update

ef17871

XuPeng-SH requested review from daviszhen, gouhongshen and reusee as code owners October 23, 2025 15:14

XuPeng-SH temporarily deployed to ci October 23, 2025 15:15 — with GitHub Actions Inactive

matrix-meow added the size/XL Denotes a PR that changes [1000, 1999] lines label Oct 23, 2025

mergify bot added the kind/bug Something isn't working label Oct 23, 2025

qodo-merge-pro bot added the Review effort 3/5 label Oct 23, 2025

gouhongshen approved these changes Oct 24, 2025

View reviewed changes

daviszhen approved these changes Oct 24, 2025

View reviewed changes

mergify bot added the queued label Oct 24, 2025

mergify bot merged commit 37a81b8 into matrixorigin:main Oct 24, 2025
19 checks passed

mergify bot removed the queued label Oct 24, 2025

mergify bot temporarily deployed to ci October 24, 2025 01:47 Inactive

mergify bot temporarily deployed to ci October 24, 2025 01:54 Inactive

mergify bot temporarily deployed to ci October 24, 2025 02:07 Inactive

mergify bot had a problem deploying to ci October 24, 2025 02:14 Failure

mergify bot temporarily deployed to ci October 24, 2025 02:36 Inactive

mergify bot temporarily deployed to ci October 24, 2025 03:00 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ss leak main #22669

Fix ss leak main #22669

Uh oh!

XuPeng-SH commented Oct 23, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Oct 23, 2025

Uh oh!

qodo-merge-pro bot commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix ss leak main #22669

Fix ss leak main #22669

Uh oh!

Conversation

XuPeng-SH commented Oct 23, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Oct 23, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-merge-pro bot commented Oct 23, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XuPeng-SH commented Oct 23, 2025 •

edited by qodo-merge-pro bot

Loading