fix: bound inference validation goroutines and add HTTP timeouts#828
Open
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
Open
fix: bound inference validation goroutines and add HTTP timeouts#828ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
ouicate wants to merge 2 commits intogonka-ai:upgrade-v0.2.11from
Conversation
Unbounded goroutine fan-out in inference validation could cause memory
exhaustion after validator downtime. Each goroutine blocks for up to
~53 minutes (payload retrieval retries + ML node lock retries) while
holding stack memory, HTTP connections, and file descriptors.
Root causes addressed:
1. Unbounded goroutines in ExecuteRecoveryValidations:
- Replaced unconditional go func() per missed inference with a
bounded worker pool (maxConcurrentValidations=10 workers)
- Workers consume from a buffered channel, capping concurrent
goroutines regardless of how many validations were missed
2. Unbounded goroutines in SampleInferenceToValidate:
- Same bounded worker pool pattern (max 10 workers)
- Wrapped in background goroutine to preserve fire-and-forget
behavior required by the event handler caller
3. Cross-path mutual exclusion:
- Added recoveryRunning atomic.Bool on InferenceValidator to
prevent concurrent recovery executions across all three trigger
paths (epoch-transition dispatcher, startup auto-recovery, and
admin HTTP endpoint). All paths share the same *InferenceValidator
instance (verified via main.go wiring).
4. Missing HTTP timeouts on payload retrieval:
- payloadRetrievalClient now has 30s timeout instead of zero
- Prevents goroutines from hanging forever on unresponsive executors
5. Missing HTTP timeout on ML node inference:
- Replaced bare http.Post() (no timeout) with validationHTTPClient
(5 minute timeout) for ML node inference calls
- Prevents permanent goroutine leaks on slow/overloaded ML nodes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Unbounded goroutine fan-out in inference validation could cause memory exhaustion after validator downtime. Each goroutine blocks for up to ~53 minutes (payload retrieval retries + ML node lock retries) while holding stack memory, HTTP connections, and file descriptors.
Root causes addressed:
Unbounded goroutines in ExecuteRecoveryValidations:
Unbounded goroutines in SampleInferenceToValidate:
Cross-path mutual exclusion:
Missing HTTP timeouts on payload retrieval:
Missing HTTP timeout on ML node inference: