feat: replica-only snapshots, per-pod services, DragonflyCluster CRD, PodLifecycle fix#481
Open
bocharov wants to merge 6 commits intodragonflydb:mainfrom
Open
feat: replica-only snapshots, per-pod services, DragonflyCluster CRD, PodLifecycle fix#481bocharov wants to merge 6 commits intodragonflydb:mainfrom
bocharov wants to merge 6 commits intodragonflydb:mainfrom
Conversation
…fecycle
When getHealthyPod() finds no healthy pod (e.g. all pods still loading
data from S3 snapshots), the PodLifecycle controller returned
ctrl.Result{}, nil — silently dropping the event with no requeue. If no
future pod events arrive (pods already stable/Running), the controller
never retries and master election never completes.
Similarly, when a pod is not ready yet, the controller drops the event
without requeuing. During rolling updates, the DragonflyReconciler's
allPodsHealthyAndHaveRole() waits forever for a role label that never
gets set, causing a deadlock.
Fix both cases by requeuing after 5 seconds.
Add a new snapshot mode that offloads snapshot I/O from master to replicas, preventing snapshot serialization from blocking write-path latency on the master. New API fields on the Snapshot spec: - enableOnReplicaOnly: when true, only replicas run snapshot_cron; the master never saves. On master restart it loads the latest replica snapshot from S3 then re-syncs via Dragonfly replication. - staggerInterval: staggers snapshot schedules across replicas so they don't all snapshot at the same moment. Each replica's cron is offset by (rank * interval) from the base Cron schedule. Controller changes: - Two-pass reconciliation in checkAndConfigureReplicas: pass 1 handles masters and unassigned pods; pass 2 processes replicas in sorted name order so each gets a stable rank for snapshot_cron staggering. - ensureMasterSnapshotCron / ensureReplicaSnapshotCron: defensive checks that correct snapshot_cron drift on every reconciliation (guards against transient CONFIG SET failures or operator restarts). - replicaOf: when enableOnReplicaOnly, defer snapshot_cron assignment to ensureReplicaSnapshotCron (rank not yet known at SLAVE OF time). - replicaOfNoOne: clear snapshot_cron on master in replica-only mode. - replTakeover: update snapshot_cron when roles switch (re-enable on new master for enableOnMasterOnly; clear for enableOnReplicaOnly). Resource generation: - Skip --snapshot_cron container arg when enableOnMasterOnly or enableOnReplicaOnly is set; the operator sets it dynamically via CONFIG SET to eliminate the startup window where pods could snapshot before the operator configures them. - Validation: enableOnMasterOnly and enableOnReplicaOnly are mutually exclusive; staggerInterval requires enableOnReplicaOnly; enableOnReplicaOnly requires at least 2 replicas. Includes unit tests for staggerCron and replicaCronForRank.
…andling Add per-pod ClusterIP services for cross-cluster routing: - Each Dragonfly pod gets its own ClusterIP service named after the pod (e.g. df-0, df-1) using the statefulset.kubernetes.io/pod-name label selector. ClusterIPs are routable cross-cluster unlike pod IPs, making them suitable for CLUSTER SLOTS responses and cross-cluster clients. Handle headless <-> ClusterIP service type transitions: - spec.clusterIP is an immutable field in Kubernetes. When the desired and existing service disagree on headless vs ClusterIP (e.g. migrating from headless to per-pod ClusterIP services), the operator now detects this and does a delete+recreate instead of failing on update.
Add a DragonflyCluster CRD and controller to manage Dragonfly cluster-mode (multi-shard + replicas), including slot allocation and scale-out rebalancing. DragonflyCluster API: - spec.shards: desired number of primary/master shards - spec.replicasPerShard: replicas per shard (excluding master) - spec.template: DragonflySpec applied to each shard - spec.rebalance: controls automatic slot rebalancing on scale-out - status tracks per-shard slot ranges, conditions, and active migrations Controller features: - Provisions per-shard Dragonfly CRs with cluster_mode=yes - Assigns stable cluster node IDs (UUIDs) per pod - Builds and pushes DFLYCLUSTER CONFIG to all shard masters - Implements slot migration via DFLYCLUSTER SLOT-MIGRATION-STATUS - Per-shard snapshot dir to avoid S3 filename collisions - Configurable service DNS suffix via DRAGONFLY_CLUSTER_SERVICE_SUFFIX env var for cross-cluster routing - Tolerates unready replicas during topology collection (only masters are required to be ready) - Advertises per-pod ClusterIP service DNS in cluster config for cross-cluster client compatibility - Better error wrapping throughout for debuggability Also includes: - RBAC rules for dragonflyclusters and dragonflyclusters/status - CRD kustomization and sample manifests - DragonflyCluster controller registration in cmd/main.go - DeepCopy methods for all new types
- Regenerate config/crd/bases/ CRDs via controller-gen to include new
enableOnReplicaOnly and staggerInterval fields in the Dragonfly CRD
and the full DragonflyCluster CRD.
- Regenerate manifests/crd.yaml and manifests/dragonfly-operator.yaml
via kustomize to include both CRDs and updated RBAC.
- Add e2e tests for snapshot configuration validation:
- Mutual exclusivity of enableOnMasterOnly and enableOnReplicaOnly
- enableOnReplicaOnly: verifies --snapshot_cron is NOT in StatefulSet
container args (operator sets it dynamically via CONFIG SET)
- enableOnMasterOnly: same verification
Update README.md with documentation for new features: - Snapshot modes: enableOnMasterOnly and enableOnReplicaOnly with staggerInterval, including YAML examples and constraints - DragonflyCluster CRD for multi-shard cluster mode with slot allocation, rebalancing, and cross-cluster routing configuration - Per-pod ClusterIP services for cross-cluster DNS routing - Updated feature list in the introduction
shitaoxai
approved these changes
Mar 15, 2026
Contributor
|
Hi @bocharov - thanks for contributing can you please split this pr to 4 prs, one for each of issues/features you mentioned. it will be easier to review and discuss each one separately. thanks! |
Contributor
|
Hi @bocharov - also curious about 'battle-tested in production' - for how long are you running dragonfly in production? what is the use case which made you require a multi sharded cluster ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds several features and a bug fix developed and battle-tested in production:
1. fix(controller): requeue on no-healthy-pod and pod-not-ready in PodLifecycle
When
getHealthyPod()finds no healthy pod (e.g. all pods still loading data from S3snapshots), the PodLifecycle controller returned
ctrl.Result{}, nil— silentlydropping the event with no requeue. If no future pod events arrive, the controller
never retries and master election never completes.
Similarly, when a pod is not ready yet, the controller drops the event without
requeuing. During rolling updates,
allPodsHealthyAndHaveRole()waits forever for arole label that never gets set, causing a deadlock.
Fix: requeue after 5 seconds in both cases.
2. feat(snapshot): add enableOnReplicaOnly mode with staggerInterval
Adds a new snapshot mode that offloads snapshot I/O from master to replicas,
preventing snapshot serialization from blocking write-path latency on the master.
New API fields on the Snapshot spec:
enableOnReplicaOnly— when true, only replicas runsnapshot_cron; the masternever saves. On master restart it loads the latest replica snapshot from S3 then
re-syncs via Dragonfly replication.
staggerInterval— staggers snapshot schedules across replicas so they do notall snapshot at the same moment. Each replica's cron is offset by
(rank × interval)from the base Cron schedule.Controller changes:
checkAndConfigureReplicas: pass 1 handles mastersand unassigned pods; pass 2 processes replicas in sorted name order so each gets
a stable rank for
snapshot_cronstaggering.ensureMasterSnapshotCron/ensureReplicaSnapshotCron: defensive checks thatcorrect
snapshot_crondrift on every reconciliation (guards against transientCONFIG SETfailures or operator restarts).replicaOf: whenenableOnReplicaOnly, defersnapshot_cronassignment toensureReplicaSnapshotCron(rank not yet known atSLAVE OFtime).replicaOfNoOne: clearsnapshot_cronon master in replica-only mode.replTakeover: updatesnapshot_cronwhen roles switch.Resource generation:
--snapshot_croncontainer arg whenenableOnMasterOnlyorenableOnReplicaOnlyis set; the operator sets it dynamically viaCONFIG SETto eliminate the startup window where pods could snapshot before the operator
configures them.
staggerIntervalrequiresenableOnReplicaOnly,enableOnReplicaOnlyrequires ≥ 2 replicas.Tests: unit tests for
staggerCronandreplicaCronForRank, e2e tests forsnapshot configuration validation.
3. feat(resources): add per-pod ClusterIP services and type transition handling
df-0,df-1) using thestatefulset.kubernetes.io/pod-namelabel selector.ClusterIPs are routable cross-cluster unlike pod IPs, making them suitable for
CLUSTER SLOTSresponses and cross-cluster clients.spec.clusterIPisimmutable in Kubernetes, so the operator detects the mismatch and does a
delete+recreate instead of failing on update.
4. feat(controller): add DragonflyCluster CRD and controller
Adds a
DragonflyClusterCRD and controller to manage Dragonfly cluster-mode(multi-shard + replicas), including slot allocation and scale-out rebalancing.
DragonflyCluster API:
spec.shards: desired number of primary/master shardsspec.replicasPerShard: replicas per shard (excluding master)spec.template:DragonflySpecapplied to each shardspec.rebalance: controls automatic slot rebalancing on scale-outController features:
cluster_mode=yesDFLYCLUSTER CONFIGto all shard mastersDFLYCLUSTER SLOT-MIGRATION-STATUSDRAGONFLY_CLUSTER_SERVICE_SUFFIXenv varAlso includes RBAC rules, CRD manifests, sample YAMLs, and README documentation.