fix: bound inference validation goroutines and add HTTP timeouts by ouicate · Pull Request #828 · gonka-ai/gonka

ouicate · 2026-02-28T15:05:02Z

Unbounded goroutine fan-out in inference validation could cause memory exhaustion after validator downtime. Each goroutine blocks for up to ~53 minutes (payload retrieval retries + ML node lock retries) while holding stack memory, HTTP connections, and file descriptors.

Root causes addressed:

Unbounded goroutines in ExecuteRecoveryValidations:
- Replaced unconditional go func() per missed inference with a bounded worker pool (maxConcurrentValidations=10 workers)
- Workers consume from a buffered channel, capping concurrent goroutines regardless of how many validations were missed
Unbounded goroutines in SampleInferenceToValidate:
- Same bounded worker pool pattern (max 10 workers)
- Wrapped in background goroutine to preserve fire-and-forget behavior required by the event handler caller
Cross-path mutual exclusion:
- Added recoveryRunning atomic.Bool on InferenceValidator to prevent concurrent recovery executions across all three trigger paths (epoch-transition dispatcher, startup auto-recovery, and admin HTTP endpoint). All paths share the same *InferenceValidator instance (verified via main.go wiring).
Missing HTTP timeouts on payload retrieval:
- payloadRetrievalClient now has 30s timeout instead of zero
- Prevents goroutines from hanging forever on unresponsive executors
Missing HTTP timeout on ML node inference:
- Replaced bare http.Post() (no timeout) with validationHTTPClient (5 minute timeout) for ML node inference calls
- Prevents permanent goroutine leaks on slow/overloaded ML nodes

Unbounded goroutine fan-out in inference validation could cause memory exhaustion after validator downtime. Each goroutine blocks for up to ~53 minutes (payload retrieval retries + ML node lock retries) while holding stack memory, HTTP connections, and file descriptors. Root causes addressed: 1. Unbounded goroutines in ExecuteRecoveryValidations: - Replaced unconditional go func() per missed inference with a bounded worker pool (maxConcurrentValidations=10 workers) - Workers consume from a buffered channel, capping concurrent goroutines regardless of how many validations were missed 2. Unbounded goroutines in SampleInferenceToValidate: - Same bounded worker pool pattern (max 10 workers) - Wrapped in background goroutine to preserve fire-and-forget behavior required by the event handler caller 3. Cross-path mutual exclusion: - Added recoveryRunning atomic.Bool on InferenceValidator to prevent concurrent recovery executions across all three trigger paths (epoch-transition dispatcher, startup auto-recovery, and admin HTTP endpoint). All paths share the same *InferenceValidator instance (verified via main.go wiring). 4. Missing HTTP timeouts on payload retrieval: - payloadRetrievalClient now has 30s timeout instead of zero - Prevents goroutines from hanging forever on unresponsive executors 5. Missing HTTP timeout on ML node inference: - Replaced bare http.Post() (no timeout) with validationHTTPClient (5 minute timeout) for ML node inference calls - Prevents permanent goroutine leaks on slow/overloaded ML nodes

ouicate added 2 commits February 28, 2026 16:03

Merge branch 'upgrade-v0.2.11' into fix/inference-validation-goroutines

850339e

tcharchian requested a review from patimen March 2, 2026 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound inference validation goroutines and add HTTP timeouts#828

fix: bound inference validation goroutines and add HTTP timeouts#828
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate:fix/inference-validation-goroutines

ouicate commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ouicate commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant