fix(controller): requeue reconciliation for readiness gate polling and master election#511
Open
davidcharbonnier wants to merge 1 commit intodragonflydb:mainfrom
Conversation
…d master election Fix three related bugs in the pod lifecycle controller where pods were dropped from reconciliation, all stemming from missing requeue returns in the Reconcile method: 1. (dragonflydb#503) The readiness gate condition `dragonflydb.io/replication-ready` never transitions to True because the controller exits early when the pod is not yet ready, preventing the readiness gate polling code from ever running. 2. (dragonflydb#497) Master election deadlocks on initial startup because no pod passes the health check (which requires the readiness gate to be satisfied), and the controller does not requeue when no healthy candidate is available. 3. (dragonflydb#508) Pods are left completely unconfigured after rolling restart because the controller does not re-enqueue pods that come up while loading RDB snapshots, missing the initial configuration window. The fix: - Changes the `!podReady` bail-out to requeue after 5s instead of silently dropping the reconciliation - Adds a requeue when no healthy master candidate is found and the triggering pod is not yet ready (breaks the deadlock) - Adds a requeue after initial replication config if the triggering pod is still loading Fixes dragonflydb#503, dragonflydb#497, dragonflydb#508 Signed-off-by: David Charbonnier <david.charbonnier@cloudimperiumgames.com>
7735031 to
10a3e2e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix three related bugs in the pod lifecycle controller where pods were dropped from reconciliation, all stemming from missing requeue returns in the Reconcile method:
(Bug: Readiness gate condition
dragonflydb.io/replication-readynever transitions toTrueafter replica sync completes #503) The readiness gate conditiondragonflydb.io/replication-readynever transitions to True because the controller exits early when the pod is not yet ready, preventing the readiness gate polling code from ever running.(enableReplicationReadinessGate: true causes deadlock during master election #497) Master election deadlocks on initial startup because no pod passes the health check (which requires the readiness gate to be satisfied), and the controller does not requeue when no healthy candidate is available.
(Pod left unconfigured (no role, no replication, no readiness gate condition) after rolling restart #508) Pods are left completely unconfigured after rolling restart because the controller does not re-enqueue pods that come up while loading RDB snapshots, missing the initial configuration window.
The fix:
!podReadybail-out to requeue after 5s instead of silently dropping the reconciliationFixes #503, #497, #508