Skip to content

Conversation

@XuPeng-SH
Copy link
Contributor

@XuPeng-SH XuPeng-SH commented Oct 23, 2025

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #22666

What this PR does / why we need it:

Fix ss leak main


PR Type

Bug fix, Enhancement


Description

  • Implement unified GC scheduler to consolidate multiple goroutines into single scheduler

  • Replace manual snapshot partition management with new SnapshotManager for better lifecycle control

  • Add LRU eviction and age-based garbage collection for snapshot partitions

  • Introduce comprehensive metrics tracking for snapshot cache performance and GC operations


Diagram Walkthrough

flowchart LR
  A["Engine Initialization"] -->|"RunGCScheduler"| B["Unified GC Scheduler"]
  B -->|"Ticker: 20min"| C["Unused Table GC"]
  B -->|"Ticker: 20min"| D["Partition State GC"]
  B -->|"Ticker: 20min"| E["Snapshot GC"]
  F["Snapshot Operations"] -->|"Find/Add"| G["SnapshotManager"]
  G -->|"LRU + Age-based"| H["TrackedPartitions"]
  H -->|"Metrics"| I["SnapshotMetrics"]
  E -->|"MaybeStartGC"| G
Loading

File Walkthrough

Relevant files
Enhancement
tracked_partitions.go
New snapshot manager with LRU and age-based GC                     

pkg/vm/engine/disttae/tracked_partitions.go

  • New file implementing SnapshotManager for centralized snapshot
    partition lifecycle management
  • Provides TrackedPartition wrapper with access time tracking for LRU
    eviction
  • Implements Find/Add operations with automatic metrics collection
  • Includes GC with both age-based and count-based (LRU) eviction
    strategies
  • Defines SnapshotGCConfig and SnapshotMetrics for configuration and
    observability
+426/-0 
engine.go
Integrate unified GC scheduler into engine                             

pkg/vm/engine/disttae/engine.go

  • Replace manual snapParts map with SnapshotManager initialization
  • Add RunGCScheduler method that consolidates three separate GC
    goroutines into single scheduler
  • Add gcPartitionState helper method for partition state garbage
    collection
  • Remove individual GC ticker goroutines from InitLogTailPushModel
+46/-4   
logtail_consumer.go
Remove individual GC goroutines from push client                 

pkg/vm/engine/disttae/logtail_consumer.go

  • Add gcSnapshotTicker constant for snapshot GC interval
  • Add TryGC method to PushClient for conditional GC execution
  • Remove individual GC ticker goroutines from InitLogTailPushModel
  • Consolidate GC logic into unified scheduler pattern
+13/-2   
types.go
Replace manual snapshot map with manager                                 

pkg/vm/engine/disttae/types.go

  • Replace manual snapParts map structure with SnapshotManager pointer
  • Remove nested sync.Mutex and partition slice from Engine struct
  • Simplify snapshot partition management through manager abstraction
+3/-8     
disttae_engine.go
Start GC scheduler in test engine initialization                 

pkg/vm/engine/test/testutil/disttae_engine.go

  • Add RunGCScheduler call after InitLogTailPushModel in test engine
    setup
  • Improve error handling for InitLogTailPushModel initialization
  • Ensure GC scheduler is started in test environment
+8/-2     
distributed_tae.go
Start unified GC scheduler in service                                       

pkg/cnservice/distributed_tae.go

  • Add RunGCScheduler call after InitLogTailPushModel in service
    initialization
  • Start unified GC scheduler as separate goroutine
+3/-0     
Tests
tracked_partitions_test.go
Complete test coverage for snapshot manager                           

pkg/vm/engine/disttae/tracked_partitions_test.go

  • Comprehensive test suite for TrackedPartition, TrackedPartitions, and
    SnapshotManager
  • Tests for LRU eviction, age-based GC, concurrent operations, and
    metrics tracking
  • Includes benchmark tests for performance validation
  • Tests edge cases like zero maxCount and empty table GC
+678/-0 
Bug fix
db.go
Refactor snapshot partition caching to use manager             

pkg/vm/engine/disttae/db.go

  • Replace manual snapshot partition caching logic with
    SnapshotManager.Find/Add calls
  • Remove sync.Mutex and nested struct for manual snapshot management
  • Update error logging to use zap structured logging
  • Add snapshot GC trigger via MaybeStartGC after adding new snapshots
  • Remove panic path and add proper error handling for snapshot creation
    failures
+31/-24 

@qodo-merge-pro
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true -->

</details></td></tr>
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-merge-pro
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add panic recovery to GC scheduler

Add a deferred panic recovery mechanism to the RunGCScheduler function. This
will prevent any panics within the GC tasks from crashing the service.

pkg/vm/engine/disttae/engine.go [854-882]

 func (e *Engine) RunGCScheduler(ctx context.Context) {
+	defer func() {
+		if r := recover(); r != nil {
+			logutil.Errorf("GC scheduler panic recovered: %v", r)
+		}
+	}()
+
 	unusedTableTicker := time.NewTicker(unsubscribeProcessTicker)
 	partitionStateTicker := time.NewTicker(gcPartitionStateTicker)
 	snapshotTicker := time.NewTicker(gcSnapshotTicker)
 
 	defer unusedTableTicker.Stop()
 	defer partitionStateTicker.Stop()
 	defer snapshotTicker.Stop()
 
 	for {
 		select {
 		case <-ctx.Done():
 			logutil.Infof("GC scheduler exit.")
 			return
 
 		case <-unusedTableTicker.C:
 			// GC unused tables in PushClient
 			e.pClient.TryGC(ctx)
 
 		case <-partitionStateTicker.C:
 			// GC partition states
 			e.gcPartitionState(ctx)
 
 		case <-snapshotTicker.C:
 			// GC snapshot partitions
 			e.snapshotMgr.MaybeStartGC()
 		}
 	}
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the new RunGCScheduler function, which runs in a goroutine, lacks panic recovery, posing a risk of crashing the entire service. Adding a recover block is a critical robustness improvement.

Medium
Add panic recovery to GC goroutine

In MaybeStartGC, add a panic recovery mechanism to the asynchronously launched
GC goroutine. This ensures that any panic is caught and logged, and that the
gcRunning flag is correctly reset.

pkg/vm/engine/disttae/tracked_partitions.go [289-314]

 func (sm *SnapshotManager) MaybeStartGC() {
 	if !sm.config.Enabled {
 		return
 	}
 
 	// Check if enough time has passed since last GC
 	lastGCTime := time.Unix(0, sm.state.lastGCTime.Load())
 	if time.Since(lastGCTime) < sm.config.GCInterval {
 		return
 	}
 
 	// Check if GC is already running
 	if sm.state.gcRunning.Load() {
 		return
 	}
 
 	// Update last GC time and mark as running
 	sm.state.lastGCTime.Store(time.Now().UnixNano())
 	sm.state.gcRunning.Store(true)
 
 	// Run GC asynchronously
 	go func() {
-		defer sm.state.gcRunning.Store(false)
+		defer func() {
+			sm.state.gcRunning.Store(false)
+			if r := recover(); r != nil {
+				logutil.Error("Snapshot GC panic recovered", zap.Any("panic", r))
+			}
+		}()
 		sm.RunGC()
 	}()
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion's reasoning that a panic would leave gcRunning as true is incorrect, as the existing defer statement would still execute. However, adding a recover block is still a valid improvement for robustness and logging, so the suggestion has moderate value.

Low
  • More

@mergify mergify bot added the queued label Oct 24, 2025
@mergify mergify bot merged commit 37a81b8 into matrixorigin:main Oct 24, 2025
19 checks passed
@mergify mergify bot removed the queued label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working Review effort 3/5 size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants