-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Enhancement Description
Introduce exponential backoff with jitter and retry limits to prevent controller tight failure loops and CPU/log exhaustion.
Background
Failed operations are immediately re-queued, creating tight loops that can consume CPU, spam logs, and hide the root cause. Retry policies need backoff, jitter, and limits.
Scope
Implement exponential backoff with jitter
- Files: internal/daemon/controller/queue.go
- Files: internal/daemon/controller/manager.go
Backoff policy
- 1s, 2s, 4s, 8s, 16s, 32s, max 60s
- Add jitter to avoid thundering herd
- Enforce max retry count
- Mark permanent failures clearly (no further retries)
Improve observability
- Log retry count and next delay
- Log terminal failure state with context
Non-Goals
- Redesigning the controller architecture or queue model
- Changing business logic of controller tasks beyond retry behavior
- Implementing distributed scheduling
Risks and Open Questions
- Must avoid delaying truly transient errors too much; tune backoff carefully
- Ensure backoff does not break time-sensitive operations
- Confirm behavior under concurrent task loads
Validation Plan
Unit and Integration Checks
- go test ./... for controller packages
- Unit tests for backoff computation (including jitter bounds)
- Tests for max retry enforcement and terminal failure behavior
End-to-End Checks
- Run daemon with induced failure and verify CPU remains stable
- Confirm logs show retry scheduling and terminal failures properly
Evidence Required in Issue Updates
- Before/after CPU/log excerpts under induced failure
- Example log lines showing retry delay and attempt count
- Test output verifying backoff schedule and max retries
Acceptance Criteria
- Failed operations retry with backoff and jitter
- Max retry count is enforced
- CPU usage during failure scenarios remains low (target <5% sustained)
- Logs are informative and not spammy
- Terminal failures stop retrying and are clearly marked
Deliverables
- PR implementing backoff + tests
- Notes describing chosen parameters and any config knobs (if added)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels