Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis backups #269

Merged
merged 43 commits into from
Sep 28, 2023
Merged

Redis backups #269

merged 43 commits into from
Sep 28, 2023

Conversation

roivaz
Copy link
Member

@roivaz roivaz commented Sep 18, 2023

This PR adds the redis backups to the saas-operator as a new api with a new controller. An example of the custom resource with the status:

apiVersion: saas.3scale.net/v1alpha1
kind: ShardedRedisBackup
metadata:
  finalizers:
    - saas.3scale.net
  name: backup
  namespace: default
spec:
  dbFile: /data/dump.rdb
  historyLimit: 2
  pause: false
  pollInterval: 10s
  s3Options:
    bucket: my-bucket
    credentialsSecretRef:
      name: aws-credentials
    path: backups
    region: us-east-1
    serviceEndpoint: http://minio.default.svc.cluster.local:9000
  schedule: '* * * * *'
  sentinelRef: sentinel
  sshOptions:
    port: 2222
    privateKeySecretRef:
      name: redis-backup-ssh-private-key
    sudo: true
    user: docker
  timeout: 5m0s
status:
  backups:
    - message: backup scheduled
      scheduledFor: "2023-09-20T08:33:00Z"
      shard: shard02
      state: Pending
    - message: backup scheduled
      scheduledFor: "2023-09-20T08:33:00Z"
      shard: shard01
      state: Pending
    - backupFile: s3://my-bucket/backups/redis-backup_shard02_2023-09-20T08:32:01Z.rdb.gz
      backupSize: 234
      finishedAt: "2023-09-20T08:32:12Z"
      message: backup complete
      scheduledFor: "2023-09-20T08:32:00Z"
      serverAlias: redis-shard-shard02-1
      serverID: 10.244.0.11:6379
      shard: shard02
      startedAt: "2023-09-20T08:32:01Z"
      state: Completed
    - backupFile: s3://my-bucket/backups/redis-backup_shard01_2023-09-20T08:32:01Z.rdb.gz
      backupSize: 232
      finishedAt: "2023-09-20T08:32:12Z"
      message: backup complete
      scheduledFor: "2023-09-20T08:32:00Z"
      serverAlias: redis-shard-shard01-1
      serverID: 10.244.0.8:6379
      shard: shard01
      startedAt: "2023-09-20T08:32:01Z"
      state: Completed

Features:

  • Backups are triggered at the same time for all redis shards. Even though this does not ensure perfect consistency among shards, it's way better than the current scenario.
  • Backups are fully executed from the operator. It only requires the redis instance to have ssh, gzip and awscli tools installed. An ssh user and private key must be provided for access to the instance
  • The s3 service endpoint can be injected in the custom resource to point to a minio instance or any other s3 compatible api.
  • The following metrics are published for backups: size, failure_count, success_count, duration.
  • Backup status is stored in the custom resource status. Errors are reported in the logs, in the status and in the metrics, greatly improving backup observability.
  • Automated e2e test that ensures backups do work.
  • BGSave command is directly executed instead of being scheduled (as we were doing up until now) so this removes the need to manually configure the RO slave in each shard.
  • The controller automatically looks for the RO slave in the shard by inspecting the Sentinel status.
  • Backups can be paused using the pause flag in the custom resource.

Limitations:

  • The backups are each run by its own goroutine. The problem with this approach is that anything that kills or terminates the saas-operator pod instantly kills all running backups. I decided not to wait for running backups to finish before the controller shutting down as this could mean waiting for a long time. I don't think this limitation has a huge impact as the saas-operator is very stable in production and if a backup is terminated ahead of time, the next one will just trigger soon enough. The way to overcome this limitation would be to run each backup in a Job, but this greatly complicates things, specially reporting the status of the backup back to the controller.

/kind feature
/priority important-soon
/assign

@3scale-robot 3scale-robot added needs-kind Indicates a PR or issue lacks a `kind/foo` label and requires one. needs-priority Indicates a PR or issue lacks a `priority/foo` label and requires one. needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. size/XL Requires about a week to complete the PR or the issue. and removed needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. labels Sep 18, 2023
@roivaz roivaz added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2023
@3scale-robot 3scale-robot removed the needs-kind Indicates a PR or issue lacks a `kind/foo` label and requires one. label Sep 20, 2023
@roivaz roivaz added needs-kind Indicates a PR or issue lacks a `kind/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. labels Sep 20, 2023
@3scale-robot 3scale-robot removed the needs-priority Indicates a PR or issue lacks a `priority/foo` label and requires one. label Sep 20, 2023
@roivaz
Copy link
Member Author

roivaz commented Sep 27, 2023

@slopezz @raelga final commits to fix some bugs are in place and tested in staging env:

  • fixed memory issue (not closing ssh connections properly)
  • fixed cleanup of redis backup files in redis instances
  • added some validation to the required Secrets that the backup uses
  • improved the AWS client configuration (extracted functionality to util package)
  • fixed a minor problem with the schedule of backups when the schedule is changed

@3scale-robot 3scale-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2023
@3scale-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 69390d9c0adb6672aa8847e235d4caddd6c565e6

@slopezz
Copy link
Member

slopezz commented Sep 28, 2023

/approve

@3scale-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slopezz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@3scale-robot 3scale-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2023
@3scale-robot 3scale-robot merged commit e1c07ad into main Sep 28, 2023
@3scale-robot 3scale-robot deleted the redis-backups branch September 28, 2023 09:46
@roivaz
Copy link
Member Author

roivaz commented Nov 3, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. needs-kind Indicates a PR or issue lacks a `kind/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. size/XL Requires about a week to complete the PR or the issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants