feature/persistence

# Feature Request: Add Persistence (with Backup and Restore) for Vault and Workflow Platform Components

## Summary

Add **production-grade persistence** and **backup/restore** capabilities for:

* The **secrets platform** (Vault)
* The **workflow platform components** (e.g. workflow metadata database, controller state, configuration)

so that:

* Secrets and workflow state survive node restarts, upgrades, and cluster failures
* We can **recover** from accidental data loss and migrate clusters safely
* The platform is aligned with standard DevOps / SRE practices for durability and DR

---

## Background / Problem

Right now, the stack assumes **ephemeral or best-effort storage** for:

* Vault data (running in a dev-style configuration or on non-backed storage)
* Workflow metadata (e.g. workflow executions, logs, and configuration)
* Per-component configuration that should be backed up (ConfigMaps, Secrets, CRDs)

This is acceptable for PoCs, but it is **not sufficient** for:

* Long-running multi-tenant environments
* Regulated or audited research settings
* Disaster recovery (cluster loss, accidental deletion, etc.)

We need **first-class, documented persistence** and **backup/restore** flows for both Vault and the workflow components.

---

## Goals

1. Persist Vault state on a **durable backend** suitable for production.
2. Persist workflow platform state (e.g. metadata DB, application configuration).
3. Provide a **documented backup and restore process** for:

   * Vault (secrets, policies, tokens)
   * Workflow metadata and configuration (e.g. database, CRDs, ConfigMaps, Secrets)
4. Make everything **configurable via values**, with reasonable defaults and the ability to plug into existing infra (e.g. external databases, S3).

---

## Non-Goals

* Designing a full enterprise DR strategy for the entire organization (this feature is scoped to this stack).
* Implementing proprietary backup tooling; we will integrate with existing ecosystem tools (e.g. Velero, snapshots, Vault’s snapshot APIs).

---

## Proposal

### 1. Vault Persistence

**a. Storage backend**

Support a **configurable storage backend** for Vault, for example:

* Default (and recommended): **Integrated Storage (Raft) with PVCs**
* Optional: external storage (Consul, Postgres, etc.) via values

Values example:

```yaml
vault:
  enabled: true
  storage:
    backend: raft    # or consul, postgres, etc.
    raft:
      storageClass: gp2
      accessModes: ["ReadWriteOnce"]
      size: 20Gi
```

**b. Persistent Volumes**

* Each Vault pod (or Raft node) gets a **PersistentVolumeClaim**.
* PVCs use a configurable `storageClass` and size.
* No more emptyDir or node-local ephemeral storage for Vault data in production mode.

**c. Unseal / Auto-unseal**

* Support integration with a KMS provider (e.g. AWS KMS, GCP KMS) for **auto-unseal**.
* Document how unseal keys and root tokens are **not** stored in the same place as Vault data backups.

---

### 2. Workflow Platform Persistence

**a. Metadata Database**

Support a **durable metadata store** for workflow platform components, for example:

* Default: in-cluster Postgres with a PVC-backed data directory.
* Optional: external Postgres endpoint via values.

Values example:

```yaml
workflows:
  metadataDatabase:
    type: postgres
    inCluster:
      enabled: true
      storageClass: gp2
      size: 10Gi
    external:
      enabled: false
      host: ""
      port: 5432
      database: ""
      userSecretName: ""
```

**b. Configuration Persistence**

* Ensure ConfigMaps and Secrets associated with workflow controllers (e.g. controller config, artifact repository config) are covered by the backup plan (see Backup section).
* Optionally, add a small utility Job or CLI to export workflow configuration for manual backup (e.g. `kubectl get ... -o yaml` baked into docs/examples).

---

### 3. Backup and Restore

#### 3.1 Vault Backup / Restore

**Backup**

* Expose and document a **Vault snapshot Job** that:

  * Runs on a scheduled basis (CronJob).
  * Uses Vault’s `/sys/storage/raft/snapshot` API (or relevant backend snapshot mechanism).
  * Writes snapshot files to a configured object store (e.g. S3 bucket).

Values example:

```yaml
vault:
  backup:
    enabled: true
    schedule: "0 3 * * *"
    destination:
      type: s3
      bucket: vault-backups
      prefix: cluster-a/
      secretRef: vault-backup-s3-credentials
```

**Restore**

* Provide a **manual restore Job template** plus documentation:

  * How to stop Vault pods safely.
  * How to use `vault operator raft snapshot restore` (or backend equivalent).
  * How to validate restore success and rejoin cluster.

Deliverables:

* `vault-backup-cronjob.yaml`
* `vault-restore-job.yaml` example
* Documentation: “Vault Backup and Restore”

---

#### 3.2 Workflow Metadata Backup / Restore

**Backup**

* If using in-cluster Postgres:

  * Provide a **CronJob** to run `pg_dump` on the metadata database and push dumps to object storage (S3, GCS, etc.).
* If using external DB:

  * Document required backup strategy and provide example scripts for `pg_dump`/`pgBackRest` usage.

Values example:

```yaml
workflows:
  metadataDatabase:
    backup:
      enabled: true
      schedule: "0 2 * * *"
      destination:
        type: s3
        bucket: workflow-metadata-backups
        prefix: cluster-a/
        secretRef: workflow-backup-s3-credentials
```

**Restore**

* Provide a **restore Job** template for in-cluster Postgres (with `psql` / `pg_restore`).
* Document how to:

  * Scale down the workflow controllers.
  * Restore the DB.
  * Scale controllers back up.
  * Verify that historical workflows and configuration are restored.

---

#### 3.3 Cluster-level Backup (Optional but Recommended)

Optionally integrate with **Velero** or a similar tool and document:

* Which namespaces should be included (Vault, workflow controllers, tenant namespaces).
* Any custom resource types that need CRD-aware backup (e.g. RepoRegistration CRs).
* Example `Backup` and `Schedule` resources.

This provides a second layer of DR for:

* ConfigMaps
* Secrets (excluding or including Vault data depending on policy)
* CRDs and their instances
* ServiceAccounts, Roles, RoleBindings, etc.

---

### 4. New Screens / CLI Hooks (Optional UX Enhancements)

* **UI**

  * “Backup & Restore” tab in an Admin section to:

    * Show latest backup timestamps and status for Vault and workflow metadata DB.
    * Optionally trigger an on-demand backup.
* **CLI (`calypr admin`)**

  * `calypr admin backup vault`
  * `calypr admin backup workflows`
  * `calypr admin restore vault --snapshot <file>`
  * `calypr admin restore workflows --dump <file>`

These commands would be thin wrappers calling the same backend API and/or Jobs.

---

## Implementation Details (High Level)

1. **Helm / chart changes**

   * Add PVC definitions and `persistence.enabled` flags for Vault and workflow metadata DB.
   * Add CronJob templates for backups (Vault, Postgres).
   * Add Job templates for restore.
   * Wire values into templates with sensible defaults.

2. **Docs**

   * New doc: `docs/vault-backup-restore.md`
   * New doc: `docs/workflows-backup-restore.md`
   * Update admin guide to reference persistence and DR.

3. **Security**

   * Configure all backup destinations using dedicated IAM roles or scoped credentials.
   * Ensure secrets used for backups are not themselves part of Vault’s backup snapshot (to avoid circular dependencies).

---

## Risks / Considerations

* Misconfiguration of backups can cause a **false sense of safety**; we must include **restore drills** in docs.
* Storage costs for regular database and snapshot backups.
* Latency/size of Vault and DB snapshots for large deployments; need tuning of schedules and retention policies.
* For Vault, incorrect restore procedures can lead to cluster split-brain or data loss; guidance must be precise.

---

## Definition of Done

* [ ] Vault uses a configurable, persistent storage backend (raft + PVC by default).
* [ ] Workflow metadata database is persisted with PVC or external DB.
* [ ] Backup CronJobs for Vault and metadata DB implemented and controlled via values.
* [ ] Restore Job templates for Vault and metadata DB added and documented.
* [ ] Admin documentation added:

  * How to enable persistence
  * How to configure backup schedules and destinations
  * How to perform a restore (step-by-step)
* [ ] Optional: `calypr admin` commands documented for backup/restore orchestration.
* [ ] Example configurations tested in at least one non-production environment:

  * Node restart
  * Cluster recreation + restore
  * Vault and metadata DB backup/restore flows verified.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/persistence #89

Feature Request: Add Persistence (with Backup and Restore) for Vault and Workflow Platform Components

Summary

Background / Problem

Goals

Non-Goals

Proposal

1. Vault Persistence

2. Workflow Platform Persistence

3. Backup and Restore

3.1 Vault Backup / Restore

3.2 Workflow Metadata Backup / Restore

3.3 Cluster-level Backup (Optional but Recommended)

4. New Screens / CLI Hooks (Optional UX Enhancements)

Implementation Details (High Level)

Risks / Considerations

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature/persistence #89

Description

Feature Request: Add Persistence (with Backup and Restore) for Vault and Workflow Platform Components

Summary

Background / Problem

Goals

Non-Goals

Proposal

1. Vault Persistence

2. Workflow Platform Persistence

3. Backup and Restore

3.1 Vault Backup / Restore

3.2 Workflow Metadata Backup / Restore

3.3 Cluster-level Backup (Optional but Recommended)

4. New Screens / CLI Hooks (Optional UX Enhancements)

Implementation Details (High Level)

Risks / Considerations

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions