-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature Request: Add Persistence (with Backup and Restore) for Vault and Workflow Platform Components
Summary
Add production-grade persistence and backup/restore capabilities for:
- The secrets platform (Vault)
- The workflow platform components (e.g. workflow metadata database, controller state, configuration)
so that:
- Secrets and workflow state survive node restarts, upgrades, and cluster failures
- We can recover from accidental data loss and migrate clusters safely
- The platform is aligned with standard DevOps / SRE practices for durability and DR
Background / Problem
Right now, the stack assumes ephemeral or best-effort storage for:
- Vault data (running in a dev-style configuration or on non-backed storage)
- Workflow metadata (e.g. workflow executions, logs, and configuration)
- Per-component configuration that should be backed up (ConfigMaps, Secrets, CRDs)
This is acceptable for PoCs, but it is not sufficient for:
- Long-running multi-tenant environments
- Regulated or audited research settings
- Disaster recovery (cluster loss, accidental deletion, etc.)
We need first-class, documented persistence and backup/restore flows for both Vault and the workflow components.
Goals
-
Persist Vault state on a durable backend suitable for production.
-
Persist workflow platform state (e.g. metadata DB, application configuration).
-
Provide a documented backup and restore process for:
- Vault (secrets, policies, tokens)
- Workflow metadata and configuration (e.g. database, CRDs, ConfigMaps, Secrets)
-
Make everything configurable via values, with reasonable defaults and the ability to plug into existing infra (e.g. external databases, S3).
Non-Goals
- Designing a full enterprise DR strategy for the entire organization (this feature is scoped to this stack).
- Implementing proprietary backup tooling; we will integrate with existing ecosystem tools (e.g. Velero, snapshots, Vault’s snapshot APIs).
Proposal
1. Vault Persistence
a. Storage backend
Support a configurable storage backend for Vault, for example:
- Default (and recommended): Integrated Storage (Raft) with PVCs
- Optional: external storage (Consul, Postgres, etc.) via values
Values example:
vault:
enabled: true
storage:
backend: raft # or consul, postgres, etc.
raft:
storageClass: gp2
accessModes: ["ReadWriteOnce"]
size: 20Gib. Persistent Volumes
- Each Vault pod (or Raft node) gets a PersistentVolumeClaim.
- PVCs use a configurable
storageClassand size. - No more emptyDir or node-local ephemeral storage for Vault data in production mode.
c. Unseal / Auto-unseal
- Support integration with a KMS provider (e.g. AWS KMS, GCP KMS) for auto-unseal.
- Document how unseal keys and root tokens are not stored in the same place as Vault data backups.
2. Workflow Platform Persistence
a. Metadata Database
Support a durable metadata store for workflow platform components, for example:
- Default: in-cluster Postgres with a PVC-backed data directory.
- Optional: external Postgres endpoint via values.
Values example:
workflows:
metadataDatabase:
type: postgres
inCluster:
enabled: true
storageClass: gp2
size: 10Gi
external:
enabled: false
host: ""
port: 5432
database: ""
userSecretName: ""b. Configuration Persistence
- Ensure ConfigMaps and Secrets associated with workflow controllers (e.g. controller config, artifact repository config) are covered by the backup plan (see Backup section).
- Optionally, add a small utility Job or CLI to export workflow configuration for manual backup (e.g.
kubectl get ... -o yamlbaked into docs/examples).
3. Backup and Restore
3.1 Vault Backup / Restore
Backup
-
Expose and document a Vault snapshot Job that:
- Runs on a scheduled basis (CronJob).
- Uses Vault’s
/sys/storage/raft/snapshotAPI (or relevant backend snapshot mechanism). - Writes snapshot files to a configured object store (e.g. S3 bucket).
Values example:
vault:
backup:
enabled: true
schedule: "0 3 * * *"
destination:
type: s3
bucket: vault-backups
prefix: cluster-a/
secretRef: vault-backup-s3-credentialsRestore
-
Provide a manual restore Job template plus documentation:
- How to stop Vault pods safely.
- How to use
vault operator raft snapshot restore(or backend equivalent). - How to validate restore success and rejoin cluster.
Deliverables:
vault-backup-cronjob.yamlvault-restore-job.yamlexample- Documentation: “Vault Backup and Restore”
3.2 Workflow Metadata Backup / Restore
Backup
-
If using in-cluster Postgres:
- Provide a CronJob to run
pg_dumpon the metadata database and push dumps to object storage (S3, GCS, etc.).
- Provide a CronJob to run
-
If using external DB:
- Document required backup strategy and provide example scripts for
pg_dump/pgBackRestusage.
- Document required backup strategy and provide example scripts for
Values example:
workflows:
metadataDatabase:
backup:
enabled: true
schedule: "0 2 * * *"
destination:
type: s3
bucket: workflow-metadata-backups
prefix: cluster-a/
secretRef: workflow-backup-s3-credentialsRestore
-
Provide a restore Job template for in-cluster Postgres (with
psql/pg_restore). -
Document how to:
- Scale down the workflow controllers.
- Restore the DB.
- Scale controllers back up.
- Verify that historical workflows and configuration are restored.
3.3 Cluster-level Backup (Optional but Recommended)
Optionally integrate with Velero or a similar tool and document:
- Which namespaces should be included (Vault, workflow controllers, tenant namespaces).
- Any custom resource types that need CRD-aware backup (e.g. RepoRegistration CRs).
- Example
BackupandScheduleresources.
This provides a second layer of DR for:
- ConfigMaps
- Secrets (excluding or including Vault data depending on policy)
- CRDs and their instances
- ServiceAccounts, Roles, RoleBindings, etc.
4. New Screens / CLI Hooks (Optional UX Enhancements)
-
UI
-
“Backup & Restore” tab in an Admin section to:
- Show latest backup timestamps and status for Vault and workflow metadata DB.
- Optionally trigger an on-demand backup.
-
-
CLI (
calypr admin)calypr admin backup vaultcalypr admin backup workflowscalypr admin restore vault --snapshot <file>calypr admin restore workflows --dump <file>
These commands would be thin wrappers calling the same backend API and/or Jobs.
Implementation Details (High Level)
-
Helm / chart changes
- Add PVC definitions and
persistence.enabledflags for Vault and workflow metadata DB. - Add CronJob templates for backups (Vault, Postgres).
- Add Job templates for restore.
- Wire values into templates with sensible defaults.
- Add PVC definitions and
-
Docs
- New doc:
docs/vault-backup-restore.md - New doc:
docs/workflows-backup-restore.md - Update admin guide to reference persistence and DR.
- New doc:
-
Security
- Configure all backup destinations using dedicated IAM roles or scoped credentials.
- Ensure secrets used for backups are not themselves part of Vault’s backup snapshot (to avoid circular dependencies).
Risks / Considerations
- Misconfiguration of backups can cause a false sense of safety; we must include restore drills in docs.
- Storage costs for regular database and snapshot backups.
- Latency/size of Vault and DB snapshots for large deployments; need tuning of schedules and retention policies.
- For Vault, incorrect restore procedures can lead to cluster split-brain or data loss; guidance must be precise.
Definition of Done
-
Vault uses a configurable, persistent storage backend (raft + PVC by default).
-
Workflow metadata database is persisted with PVC or external DB.
-
Backup CronJobs for Vault and metadata DB implemented and controlled via values.
-
Restore Job templates for Vault and metadata DB added and documented.
-
Admin documentation added:
- How to enable persistence
- How to configure backup schedules and destinations
- How to perform a restore (step-by-step)
-
Optional:
calypr admincommands documented for backup/restore orchestration. -
Example configurations tested in at least one non-production environment:
- Node restart
- Cluster recreation + restore
- Vault and metadata DB backup/restore flows verified.