Skip to content

fix(controller): requeue reconciliation for readiness gate polling and master election#511

Open
davidcharbonnier wants to merge 1 commit intodragonflydb:mainfrom
davidcharbonnier:fix/readiness-gate-requeue-and-master-election
Open

fix(controller): requeue reconciliation for readiness gate polling and master election#511
davidcharbonnier wants to merge 1 commit intodragonflydb:mainfrom
davidcharbonnier:fix/readiness-gate-requeue-and-master-election

Conversation

@davidcharbonnier
Copy link
Copy Markdown

Fix three related bugs in the pod lifecycle controller where pods were dropped from reconciliation, all stemming from missing requeue returns in the Reconcile method:

  1. (Bug: Readiness gate condition dragonflydb.io/replication-ready never transitions to True after replica sync completes #503) The readiness gate condition dragonflydb.io/replication-ready never transitions to True because the controller exits early when the pod is not yet ready, preventing the readiness gate polling code from ever running.

  2. (enableReplicationReadinessGate: true causes deadlock during master election #497) Master election deadlocks on initial startup because no pod passes the health check (which requires the readiness gate to be satisfied), and the controller does not requeue when no healthy candidate is available.

  3. (Pod left unconfigured (no role, no replication, no readiness gate condition) after rolling restart #508) Pods are left completely unconfigured after rolling restart because the controller does not re-enqueue pods that come up while loading RDB snapshots, missing the initial configuration window.

The fix:

  • Changes the !podReady bail-out to requeue after 5s instead of silently dropping the reconciliation
  • Adds a requeue when no healthy master candidate is found and the triggering pod is not yet ready (breaks the deadlock)
  • Adds a requeue after initial replication config if the triggering pod is still loading

Fixes #503, #497, #508

…d master election

Fix three related bugs in the pod lifecycle controller where pods were
dropped from reconciliation, all stemming from missing requeue returns
in the Reconcile method:

1. (dragonflydb#503) The readiness gate condition `dragonflydb.io/replication-ready`
   never transitions to True because the controller exits early when the
   pod is not yet ready, preventing the readiness gate polling code from
   ever running.

2. (dragonflydb#497) Master election deadlocks on initial startup because no pod
   passes the health check (which requires the readiness gate to be
   satisfied), and the controller does not requeue when no healthy
   candidate is available.

3. (dragonflydb#508) Pods are left completely unconfigured after rolling restart
   because the controller does not re-enqueue pods that come up while
   loading RDB snapshots, missing the initial configuration window.

The fix:
- Changes the `!podReady` bail-out to requeue after 5s instead of
  silently dropping the reconciliation
- Adds a requeue when no healthy master candidate is found and the
  triggering pod is not yet ready (breaks the deadlock)
- Adds a requeue after initial replication config if the triggering
  pod is still loading

Fixes dragonflydb#503, dragonflydb#497, dragonflydb#508

Signed-off-by: David Charbonnier <david.charbonnier@cloudimperiumgames.com>
@davidcharbonnier davidcharbonnier force-pushed the fix/readiness-gate-requeue-and-master-election branch from 7735031 to 10a3e2e Compare April 1, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Readiness gate condition dragonflydb.io/replication-ready never transitions to True after replica sync completes

1 participant