Skip to content

feat: replica-only snapshots, per-pod services, DragonflyCluster CRD, PodLifecycle fix#481

Open
bocharov wants to merge 6 commits intodragonflydb:mainfrom
bocharov:feat/snapshot-cluster-improvements
Open

feat: replica-only snapshots, per-pod services, DragonflyCluster CRD, PodLifecycle fix#481
bocharov wants to merge 6 commits intodragonflydb:mainfrom
bocharov:feat/snapshot-cluster-improvements

Conversation

@bocharov
Copy link
Copy Markdown

Summary

This PR adds several features and a bug fix developed and battle-tested in production:

  1. PodLifecycle deadlock fix — requeue on no-healthy-pod and pod-not-ready
  2. Replica-only snapshot mode with staggered cron scheduling
  3. Per-pod ClusterIP services for cross-cluster routing
  4. DragonflyCluster CRD and controller for multi-shard cluster mode

1. fix(controller): requeue on no-healthy-pod and pod-not-ready in PodLifecycle

When getHealthyPod() finds no healthy pod (e.g. all pods still loading data from S3
snapshots), the PodLifecycle controller returned ctrl.Result{}, nil — silently
dropping the event with no requeue. If no future pod events arrive, the controller
never retries and master election never completes.

Similarly, when a pod is not ready yet, the controller drops the event without
requeuing. During rolling updates, allPodsHealthyAndHaveRole() waits forever for a
role label that never gets set, causing a deadlock.

Fix: requeue after 5 seconds in both cases.

2. feat(snapshot): add enableOnReplicaOnly mode with staggerInterval

Adds a new snapshot mode that offloads snapshot I/O from master to replicas,
preventing snapshot serialization from blocking write-path latency on the master.

New API fields on the Snapshot spec:

  • enableOnReplicaOnly — when true, only replicas run snapshot_cron; the master
    never saves. On master restart it loads the latest replica snapshot from S3 then
    re-syncs via Dragonfly replication.
  • staggerInterval — staggers snapshot schedules across replicas so they do not
    all snapshot at the same moment. Each replica's cron is offset by
    (rank × interval) from the base Cron schedule.

Controller changes:

  • Two-pass reconciliation in checkAndConfigureReplicas: pass 1 handles masters
    and unassigned pods; pass 2 processes replicas in sorted name order so each gets
    a stable rank for snapshot_cron staggering.
  • ensureMasterSnapshotCron / ensureReplicaSnapshotCron: defensive checks that
    correct snapshot_cron drift on every reconciliation (guards against transient
    CONFIG SET failures or operator restarts).
  • replicaOf: when enableOnReplicaOnly, defer snapshot_cron assignment to
    ensureReplicaSnapshotCron (rank not yet known at SLAVE OF time).
  • replicaOfNoOne: clear snapshot_cron on master in replica-only mode.
  • replTakeover: update snapshot_cron when roles switch.

Resource generation:

  • Skip --snapshot_cron container arg when enableOnMasterOnly or
    enableOnReplicaOnly is set; the operator sets it dynamically via CONFIG SET
    to eliminate the startup window where pods could snapshot before the operator
    configures them.
  • Validation: mutual exclusivity, staggerInterval requires enableOnReplicaOnly,
    enableOnReplicaOnly requires ≥ 2 replicas.

Tests: unit tests for staggerCron and replicaCronForRank, e2e tests for
snapshot configuration validation.

3. feat(resources): add per-pod ClusterIP services and type transition handling

  • Each Dragonfly pod gets its own ClusterIP service named after the pod (e.g.
    df-0, df-1) using the statefulset.kubernetes.io/pod-name label selector.
    ClusterIPs are routable cross-cluster unlike pod IPs, making them suitable for
    CLUSTER SLOTS responses and cross-cluster clients.
  • Handle headless ↔ ClusterIP service type transitions: spec.clusterIP is
    immutable in Kubernetes, so the operator detects the mismatch and does a
    delete+recreate instead of failing on update.

4. feat(controller): add DragonflyCluster CRD and controller

Adds a DragonflyCluster CRD and controller to manage Dragonfly cluster-mode
(multi-shard + replicas), including slot allocation and scale-out rebalancing.

DragonflyCluster API:

  • spec.shards: desired number of primary/master shards
  • spec.replicasPerShard: replicas per shard (excluding master)
  • spec.template: DragonflySpec applied to each shard
  • spec.rebalance: controls automatic slot rebalancing on scale-out

Controller features:

  • Provisions per-shard Dragonfly CRs with cluster_mode=yes
  • Assigns stable cluster node IDs (UUIDs) per pod
  • Builds and pushes DFLYCLUSTER CONFIG to all shard masters
  • Implements slot migration via DFLYCLUSTER SLOT-MIGRATION-STATUS
  • Per-shard snapshot dir to avoid S3 filename collisions
  • Configurable service DNS suffix via DRAGONFLY_CLUSTER_SERVICE_SUFFIX env var
  • Tolerates unready replicas during topology collection
  • Advertises per-pod ClusterIP service DNS in cluster config

Also includes RBAC rules, CRD manifests, sample YAMLs, and README documentation.

…fecycle

When getHealthyPod() finds no healthy pod (e.g. all pods still loading
data from S3 snapshots), the PodLifecycle controller returned
ctrl.Result{}, nil — silently dropping the event with no requeue. If no
future pod events arrive (pods already stable/Running), the controller
never retries and master election never completes.

Similarly, when a pod is not ready yet, the controller drops the event
without requeuing. During rolling updates, the DragonflyReconciler's
allPodsHealthyAndHaveRole() waits forever for a role label that never
gets set, causing a deadlock.

Fix both cases by requeuing after 5 seconds.
Add a new snapshot mode that offloads snapshot I/O from master to
replicas, preventing snapshot serialization from blocking write-path
latency on the master.

New API fields on the Snapshot spec:
- enableOnReplicaOnly: when true, only replicas run snapshot_cron;
  the master never saves. On master restart it loads the latest
  replica snapshot from S3 then re-syncs via Dragonfly replication.
- staggerInterval: staggers snapshot schedules across replicas so
  they don't all snapshot at the same moment. Each replica's cron is
  offset by (rank * interval) from the base Cron schedule.

Controller changes:
- Two-pass reconciliation in checkAndConfigureReplicas: pass 1 handles
  masters and unassigned pods; pass 2 processes replicas in sorted name
  order so each gets a stable rank for snapshot_cron staggering.
- ensureMasterSnapshotCron / ensureReplicaSnapshotCron: defensive checks
  that correct snapshot_cron drift on every reconciliation (guards
  against transient CONFIG SET failures or operator restarts).
- replicaOf: when enableOnReplicaOnly, defer snapshot_cron assignment
  to ensureReplicaSnapshotCron (rank not yet known at SLAVE OF time).
- replicaOfNoOne: clear snapshot_cron on master in replica-only mode.
- replTakeover: update snapshot_cron when roles switch (re-enable on
  new master for enableOnMasterOnly; clear for enableOnReplicaOnly).

Resource generation:
- Skip --snapshot_cron container arg when enableOnMasterOnly or
  enableOnReplicaOnly is set; the operator sets it dynamically via
  CONFIG SET to eliminate the startup window where pods could snapshot
  before the operator configures them.
- Validation: enableOnMasterOnly and enableOnReplicaOnly are mutually
  exclusive; staggerInterval requires enableOnReplicaOnly;
  enableOnReplicaOnly requires at least 2 replicas.

Includes unit tests for staggerCron and replicaCronForRank.
…andling

Add per-pod ClusterIP services for cross-cluster routing:
- Each Dragonfly pod gets its own ClusterIP service named after the pod
  (e.g. df-0, df-1) using the statefulset.kubernetes.io/pod-name label
  selector. ClusterIPs are routable cross-cluster unlike pod IPs, making
  them suitable for CLUSTER SLOTS responses and cross-cluster clients.

Handle headless <-> ClusterIP service type transitions:
- spec.clusterIP is an immutable field in Kubernetes. When the desired
  and existing service disagree on headless vs ClusterIP (e.g. migrating
  from headless to per-pod ClusterIP services), the operator now detects
  this and does a delete+recreate instead of failing on update.
Add a DragonflyCluster CRD and controller to manage Dragonfly
cluster-mode (multi-shard + replicas), including slot allocation
and scale-out rebalancing.

DragonflyCluster API:
- spec.shards: desired number of primary/master shards
- spec.replicasPerShard: replicas per shard (excluding master)
- spec.template: DragonflySpec applied to each shard
- spec.rebalance: controls automatic slot rebalancing on scale-out
- status tracks per-shard slot ranges, conditions, and active migrations

Controller features:
- Provisions per-shard Dragonfly CRs with cluster_mode=yes
- Assigns stable cluster node IDs (UUIDs) per pod
- Builds and pushes DFLYCLUSTER CONFIG to all shard masters
- Implements slot migration via DFLYCLUSTER SLOT-MIGRATION-STATUS
- Per-shard snapshot dir to avoid S3 filename collisions
- Configurable service DNS suffix via DRAGONFLY_CLUSTER_SERVICE_SUFFIX
  env var for cross-cluster routing
- Tolerates unready replicas during topology collection (only masters
  are required to be ready)
- Advertises per-pod ClusterIP service DNS in cluster config for
  cross-cluster client compatibility
- Better error wrapping throughout for debuggability

Also includes:
- RBAC rules for dragonflyclusters and dragonflyclusters/status
- CRD kustomization and sample manifests
- DragonflyCluster controller registration in cmd/main.go
- DeepCopy methods for all new types
- Regenerate config/crd/bases/ CRDs via controller-gen to include new
  enableOnReplicaOnly and staggerInterval fields in the Dragonfly CRD
  and the full DragonflyCluster CRD.
- Regenerate manifests/crd.yaml and manifests/dragonfly-operator.yaml
  via kustomize to include both CRDs and updated RBAC.
- Add e2e tests for snapshot configuration validation:
  - Mutual exclusivity of enableOnMasterOnly and enableOnReplicaOnly
  - enableOnReplicaOnly: verifies --snapshot_cron is NOT in StatefulSet
    container args (operator sets it dynamically via CONFIG SET)
  - enableOnMasterOnly: same verification
Update README.md with documentation for new features:
- Snapshot modes: enableOnMasterOnly and enableOnReplicaOnly with
  staggerInterval, including YAML examples and constraints
- DragonflyCluster CRD for multi-shard cluster mode with slot
  allocation, rebalancing, and cross-cluster routing configuration
- Per-pod ClusterIP services for cross-cluster DNS routing
- Updated feature list in the introduction
@ashotland
Copy link
Copy Markdown
Contributor

Hi @bocharov - thanks for contributing

can you please split this pr to 4 prs, one for each of issues/features you mentioned.

it will be easier to review and discuss each one separately.

thanks!

@ashotland
Copy link
Copy Markdown
Contributor

Hi @bocharov - also curious about 'battle-tested in production' - for how long are you running dragonfly in production?

what is the use case which made you require a multi sharded cluster ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants