Skip to content

feature/persistence #89

@bwalsh

Description

@bwalsh

Feature Request: Add Persistence (with Backup and Restore) for Vault and Workflow Platform Components

Summary

Add production-grade persistence and backup/restore capabilities for:

  • The secrets platform (Vault)
  • The workflow platform components (e.g. workflow metadata database, controller state, configuration)

so that:

  • Secrets and workflow state survive node restarts, upgrades, and cluster failures
  • We can recover from accidental data loss and migrate clusters safely
  • The platform is aligned with standard DevOps / SRE practices for durability and DR

Background / Problem

Right now, the stack assumes ephemeral or best-effort storage for:

  • Vault data (running in a dev-style configuration or on non-backed storage)
  • Workflow metadata (e.g. workflow executions, logs, and configuration)
  • Per-component configuration that should be backed up (ConfigMaps, Secrets, CRDs)

This is acceptable for PoCs, but it is not sufficient for:

  • Long-running multi-tenant environments
  • Regulated or audited research settings
  • Disaster recovery (cluster loss, accidental deletion, etc.)

We need first-class, documented persistence and backup/restore flows for both Vault and the workflow components.


Goals

  1. Persist Vault state on a durable backend suitable for production.

  2. Persist workflow platform state (e.g. metadata DB, application configuration).

  3. Provide a documented backup and restore process for:

    • Vault (secrets, policies, tokens)
    • Workflow metadata and configuration (e.g. database, CRDs, ConfigMaps, Secrets)
  4. Make everything configurable via values, with reasonable defaults and the ability to plug into existing infra (e.g. external databases, S3).


Non-Goals

  • Designing a full enterprise DR strategy for the entire organization (this feature is scoped to this stack).
  • Implementing proprietary backup tooling; we will integrate with existing ecosystem tools (e.g. Velero, snapshots, Vault’s snapshot APIs).

Proposal

1. Vault Persistence

a. Storage backend

Support a configurable storage backend for Vault, for example:

  • Default (and recommended): Integrated Storage (Raft) with PVCs
  • Optional: external storage (Consul, Postgres, etc.) via values

Values example:

vault:
  enabled: true
  storage:
    backend: raft    # or consul, postgres, etc.
    raft:
      storageClass: gp2
      accessModes: ["ReadWriteOnce"]
      size: 20Gi

b. Persistent Volumes

  • Each Vault pod (or Raft node) gets a PersistentVolumeClaim.
  • PVCs use a configurable storageClass and size.
  • No more emptyDir or node-local ephemeral storage for Vault data in production mode.

c. Unseal / Auto-unseal

  • Support integration with a KMS provider (e.g. AWS KMS, GCP KMS) for auto-unseal.
  • Document how unseal keys and root tokens are not stored in the same place as Vault data backups.

2. Workflow Platform Persistence

a. Metadata Database

Support a durable metadata store for workflow platform components, for example:

  • Default: in-cluster Postgres with a PVC-backed data directory.
  • Optional: external Postgres endpoint via values.

Values example:

workflows:
  metadataDatabase:
    type: postgres
    inCluster:
      enabled: true
      storageClass: gp2
      size: 10Gi
    external:
      enabled: false
      host: ""
      port: 5432
      database: ""
      userSecretName: ""

b. Configuration Persistence

  • Ensure ConfigMaps and Secrets associated with workflow controllers (e.g. controller config, artifact repository config) are covered by the backup plan (see Backup section).
  • Optionally, add a small utility Job or CLI to export workflow configuration for manual backup (e.g. kubectl get ... -o yaml baked into docs/examples).

3. Backup and Restore

3.1 Vault Backup / Restore

Backup

  • Expose and document a Vault snapshot Job that:

    • Runs on a scheduled basis (CronJob).
    • Uses Vault’s /sys/storage/raft/snapshot API (or relevant backend snapshot mechanism).
    • Writes snapshot files to a configured object store (e.g. S3 bucket).

Values example:

vault:
  backup:
    enabled: true
    schedule: "0 3 * * *"
    destination:
      type: s3
      bucket: vault-backups
      prefix: cluster-a/
      secretRef: vault-backup-s3-credentials

Restore

  • Provide a manual restore Job template plus documentation:

    • How to stop Vault pods safely.
    • How to use vault operator raft snapshot restore (or backend equivalent).
    • How to validate restore success and rejoin cluster.

Deliverables:

  • vault-backup-cronjob.yaml
  • vault-restore-job.yaml example
  • Documentation: “Vault Backup and Restore”

3.2 Workflow Metadata Backup / Restore

Backup

  • If using in-cluster Postgres:

    • Provide a CronJob to run pg_dump on the metadata database and push dumps to object storage (S3, GCS, etc.).
  • If using external DB:

    • Document required backup strategy and provide example scripts for pg_dump/pgBackRest usage.

Values example:

workflows:
  metadataDatabase:
    backup:
      enabled: true
      schedule: "0 2 * * *"
      destination:
        type: s3
        bucket: workflow-metadata-backups
        prefix: cluster-a/
        secretRef: workflow-backup-s3-credentials

Restore

  • Provide a restore Job template for in-cluster Postgres (with psql / pg_restore).

  • Document how to:

    • Scale down the workflow controllers.
    • Restore the DB.
    • Scale controllers back up.
    • Verify that historical workflows and configuration are restored.

3.3 Cluster-level Backup (Optional but Recommended)

Optionally integrate with Velero or a similar tool and document:

  • Which namespaces should be included (Vault, workflow controllers, tenant namespaces).
  • Any custom resource types that need CRD-aware backup (e.g. RepoRegistration CRs).
  • Example Backup and Schedule resources.

This provides a second layer of DR for:

  • ConfigMaps
  • Secrets (excluding or including Vault data depending on policy)
  • CRDs and their instances
  • ServiceAccounts, Roles, RoleBindings, etc.

4. New Screens / CLI Hooks (Optional UX Enhancements)

  • UI

    • “Backup & Restore” tab in an Admin section to:

      • Show latest backup timestamps and status for Vault and workflow metadata DB.
      • Optionally trigger an on-demand backup.
  • CLI (calypr admin)

    • calypr admin backup vault
    • calypr admin backup workflows
    • calypr admin restore vault --snapshot <file>
    • calypr admin restore workflows --dump <file>

These commands would be thin wrappers calling the same backend API and/or Jobs.


Implementation Details (High Level)

  1. Helm / chart changes

    • Add PVC definitions and persistence.enabled flags for Vault and workflow metadata DB.
    • Add CronJob templates for backups (Vault, Postgres).
    • Add Job templates for restore.
    • Wire values into templates with sensible defaults.
  2. Docs

    • New doc: docs/vault-backup-restore.md
    • New doc: docs/workflows-backup-restore.md
    • Update admin guide to reference persistence and DR.
  3. Security

    • Configure all backup destinations using dedicated IAM roles or scoped credentials.
    • Ensure secrets used for backups are not themselves part of Vault’s backup snapshot (to avoid circular dependencies).

Risks / Considerations

  • Misconfiguration of backups can cause a false sense of safety; we must include restore drills in docs.
  • Storage costs for regular database and snapshot backups.
  • Latency/size of Vault and DB snapshots for large deployments; need tuning of schedules and retention policies.
  • For Vault, incorrect restore procedures can lead to cluster split-brain or data loss; guidance must be precise.

Definition of Done

  • Vault uses a configurable, persistent storage backend (raft + PVC by default).

  • Workflow metadata database is persisted with PVC or external DB.

  • Backup CronJobs for Vault and metadata DB implemented and controlled via values.

  • Restore Job templates for Vault and metadata DB added and documented.

  • Admin documentation added:

    • How to enable persistence
    • How to configure backup schedules and destinations
    • How to perform a restore (step-by-step)
  • Optional: calypr admin commands documented for backup/restore orchestration.

  • Example configurations tested in at least one non-production environment:

    • Node restart
    • Cluster recreation + restore
    • Vault and metadata DB backup/restore flows verified.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions