Skip to content

fix(controller): add bounded exponential backoff and jitter to queue retries#128

Open
kangeunchan wants to merge 2 commits intoaltuslabsxyz:mainfrom
kangeunchan:fix/controller-backoff-policy
Open

fix(controller): add bounded exponential backoff and jitter to queue retries#128
kangeunchan wants to merge 2 commits intoaltuslabsxyz:mainfrom
kangeunchan:fix/controller-backoff-policy

Conversation

@kangeunchan
Copy link
Collaborator

Enhancement Description

Add bounded exponential backoff and jitter to controller retries to prevent hot failure loops and noisy recovery behavior.

Summary

This PR replaces immediate retry loops with controlled delayed retries, retry caps, and terminal handling in daemon controller queue processing.

Previously, repeated reconcile failures could requeue too aggressively:

  • immediate retries increased CPU/log churn
  • persistent failures lacked clear terminal behavior
  • retry observability was limited

Now retries are delayed, bounded, and explicitly tracked.

What Changed

1) Added backoff-aware retry scheduling to queue

Updated:

  • internal/daemon/controller/queue.go

Includes:

  • exponential backoff
  • jitter
  • max-delay cap
  • max-retry enforcement

2) Integrated policy into manager reconcile flow

Updated:

  • internal/daemon/controller/manager.go

Behavior:

  • reconcile errors use delayed backoff requeue
  • successful reconcile clears retry state
  • max retry exhaustion transitions to terminal handling

3) Improved failure-path observability

  • logs now include retry count and next delay
  • terminal exhaustion is explicitly logged

4) Tests

Added/updated tests:

  • internal/daemon/controller/queue_test.go
  • internal/daemon/controller/manager_test.go

Why This Is Needed

This is a resilience improvement.
Bounded retries reduce hot loops and make failure recovery behavior predictable.

Behavior Notes / Regression Impact

Intentional failure-path behavior change:

  • retries are delayed and bounded
  • persistent failures eventually stop retrying

No change to successful reconcile semantics.

Validation Performed

  • go test ./internal/daemon/controller -count=1
  • go test ./internal/daemon/...
  • go test ./...
  • golangci-lint run ./internal/daemon/controller/...

Signed-off-by: kangeunchan <kangeunchan080310@gmail.com>
Signed-off-by: kangeunchan <kangeunchan080310@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant