Skip to content

[fix] Add bounded exponential backoff and jitter to controller queue retries #117

@kangeunchan

Description

@kangeunchan

Enhancement Description

Introduce exponential backoff with jitter and retry limits to prevent controller tight failure loops and CPU/log exhaustion.

Background

Failed operations are immediately re-queued, creating tight loops that can consume CPU, spam logs, and hide the root cause. Retry policies need backoff, jitter, and limits.

Scope

Implement exponential backoff with jitter

  • Files: internal/daemon/controller/queue.go
  • Files: internal/daemon/controller/manager.go

Backoff policy

  • 1s, 2s, 4s, 8s, 16s, 32s, max 60s
  • Add jitter to avoid thundering herd
  • Enforce max retry count
  • Mark permanent failures clearly (no further retries)

Improve observability

  • Log retry count and next delay
  • Log terminal failure state with context

Non-Goals

  • Redesigning the controller architecture or queue model
  • Changing business logic of controller tasks beyond retry behavior
  • Implementing distributed scheduling

Risks and Open Questions

  • Must avoid delaying truly transient errors too much; tune backoff carefully
  • Ensure backoff does not break time-sensitive operations
  • Confirm behavior under concurrent task loads

Validation Plan

Unit and Integration Checks

  • go test ./... for controller packages
  • Unit tests for backoff computation (including jitter bounds)
  • Tests for max retry enforcement and terminal failure behavior

End-to-End Checks

  • Run daemon with induced failure and verify CPU remains stable
  • Confirm logs show retry scheduling and terminal failures properly

Evidence Required in Issue Updates

  • Before/after CPU/log excerpts under induced failure
  • Example log lines showing retry delay and attempt count
  • Test output verifying backoff schedule and max retries

Acceptance Criteria

  • Failed operations retry with backoff and jitter
  • Max retry count is enforced
  • CPU usage during failure scenarios remains low (target <5% sustained)
  • Logs are informative and not spammy
  • Terminal failures stop retrying and are clearly marked

Deliverables

  • PR implementing backoff + tests
  • Notes describing chosen parameters and any config knobs (if added)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions