Skip to content

fix: bound inference validation goroutines and add HTTP timeouts#828

Open
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate:fix/inference-validation-goroutines
Open

fix: bound inference validation goroutines and add HTTP timeouts#828
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate:fix/inference-validation-goroutines

Conversation

@ouicate
Copy link

@ouicate ouicate commented Feb 28, 2026

Unbounded goroutine fan-out in inference validation could cause memory exhaustion after validator downtime. Each goroutine blocks for up to ~53 minutes (payload retrieval retries + ML node lock retries) while holding stack memory, HTTP connections, and file descriptors.

Root causes addressed:

  1. Unbounded goroutines in ExecuteRecoveryValidations:

    • Replaced unconditional go func() per missed inference with a bounded worker pool (maxConcurrentValidations=10 workers)
    • Workers consume from a buffered channel, capping concurrent goroutines regardless of how many validations were missed
  2. Unbounded goroutines in SampleInferenceToValidate:

    • Same bounded worker pool pattern (max 10 workers)
    • Wrapped in background goroutine to preserve fire-and-forget behavior required by the event handler caller
  3. Cross-path mutual exclusion:

    • Added recoveryRunning atomic.Bool on InferenceValidator to prevent concurrent recovery executions across all three trigger paths (epoch-transition dispatcher, startup auto-recovery, and admin HTTP endpoint). All paths share the same *InferenceValidator instance (verified via main.go wiring).
  4. Missing HTTP timeouts on payload retrieval:

    • payloadRetrievalClient now has 30s timeout instead of zero
    • Prevents goroutines from hanging forever on unresponsive executors
  5. Missing HTTP timeout on ML node inference:

    • Replaced bare http.Post() (no timeout) with validationHTTPClient (5 minute timeout) for ML node inference calls
    • Prevents permanent goroutine leaks on slow/overloaded ML nodes

Unbounded goroutine fan-out in inference validation could cause memory
exhaustion after validator downtime. Each goroutine blocks for up to
~53 minutes (payload retrieval retries + ML node lock retries) while
holding stack memory, HTTP connections, and file descriptors.

Root causes addressed:

1. Unbounded goroutines in ExecuteRecoveryValidations:
   - Replaced unconditional go func() per missed inference with a
     bounded worker pool (maxConcurrentValidations=10 workers)
   - Workers consume from a buffered channel, capping concurrent
     goroutines regardless of how many validations were missed

2. Unbounded goroutines in SampleInferenceToValidate:
   - Same bounded worker pool pattern (max 10 workers)
   - Wrapped in background goroutine to preserve fire-and-forget
     behavior required by the event handler caller

3. Cross-path mutual exclusion:
   - Added recoveryRunning atomic.Bool on InferenceValidator to
     prevent concurrent recovery executions across all three trigger
     paths (epoch-transition dispatcher, startup auto-recovery, and
     admin HTTP endpoint). All paths share the same *InferenceValidator
     instance (verified via main.go wiring).

4. Missing HTTP timeouts on payload retrieval:
   - payloadRetrievalClient now has 30s timeout instead of zero
   - Prevents goroutines from hanging forever on unresponsive executors

5. Missing HTTP timeout on ML node inference:
   - Replaced bare http.Post() (no timeout) with validationHTTPClient
     (5 minute timeout) for ML node inference calls
   - Prevents permanent goroutine leaks on slow/overloaded ML nodes
@tcharchian tcharchian requested a review from patimen March 2, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant