Skip to content

Automated Backup and Disaster Recovery #167

@rucka

Description

@rucka

Story Statement

As a platform operations engineer
I want automated backup and disaster recovery for the knowledge service
So that KB content is protected against data loss and can be rapidly restored after failures

Where: Knowledge service infrastructure — backup and recovery automation

Epic Context

Parent Epic: Platform Hardening & Enterprise Readiness #68
Status: Refined
Priority: P1 (Should-Have)

Status Workflow

  • Refined: Story is detailed, estimated, and ready for development
  • In Progress: Story is actively being developed
  • Done: Story delivered and accepted

Acceptance Criteria

Functional Requirements

  1. Given the knowledge service is running in production
    When the daily backup cron runs at 02:00 UTC
    Then a full PostgreSQL backup is created, compressed, encrypted, and stored in the backup S3 bucket with timestamp naming

  2. Given continuous WAL archiving is enabled
    When any DB transaction commits
    Then WAL segments are streamed to the backup S3 bucket, enabling point-in-time recovery within the RPO window (<1 hour)

  3. Given a disaster occurs and data is lost
    When an ops engineer runs the restore procedure
    Then the service is restored to a specific point-in-time within RTO (<4 hours): DB restored from backup + WAL replay, S3 packages restored from cross-region replica

  4. Given automated backup verification runs weekly
    When the verification job executes
    Then it restores the latest backup to a test environment, runs validation checks (table counts, sample data integrity), and reports pass/fail

  5. Given a backup job fails
    When the failure is detected
    Then an alert fires immediately (critical severity); ops engineer is notified via all configured channels

  6. Given backup retention policy (30 days daily, 12 months monthly)
    When the cleanup job runs
    Then expired backups are deleted from the backup bucket; active backups are never deleted

Business Rules

  • RPO: <1 hour (WAL archiving); RTO: <4 hours (documented restore procedure)
  • Backup encryption: uses the same KMS key as production data (from Data Encryption at Rest and in Transit #166)
  • Daily full backup + continuous WAL archiving
  • S3 cross-region replication for KB packages (separate from DB backup)
  • Backup verification: weekly automated restore test to isolated environment
  • Retention: 30 daily + 12 monthly backups
  • Restore procedure: documented runbook with step-by-step commands

Edge Cases and Error Handling

  • Backup storage full: Alert before reaching 90% capacity; auto-cleanup of oldest expired backups
  • WAL archiving lag: Alert if lag exceeds 15 minutes; investigate replication slot
  • Restore to point-in-time with corrupted WAL: Fall back to nearest daily backup; report data loss window
  • Cross-region replication delay: Monitor replication lag; alert if >1 hour

Definition of Done Checklist

Development Completion

  • All 6 acceptance criteria implemented and verified
  • Daily backup cron job (pg_dump or WAL-G)
  • Continuous WAL archiving to S3
  • S3 cross-region replication for packages
  • Restore procedure script (one-command)
  • Backup verification weekly job
  • Backup retention cleanup job
  • Backup monitoring alerts
  • DR runbook documentation
  • Integration test: backup → tamper → restore → verify

Quality Assurance

  • Backup completes within 1-hour maintenance window
  • Restore tested and verified <4 hours RTO
  • WAL archiving lag <15 minutes under normal load
  • Backup verification catches corruption

Deployment and Release

  • Backup S3 bucket configured with encryption and versioning
  • Cross-region replication enabled for packages bucket
  • Cron jobs scheduled and monitored
  • DR runbook reviewed by ops team

Story Sizing and Sprint Readiness

Refined Story Points

Final Story Points: XL(10)
Confidence Level: Medium
Sizing Justification: DB backup automation, WAL archiving, S3 replication, restore scripting, verification automation, monitoring. Significant infrastructure work. Complexity depends on managed vs self-hosted DB.

Sprint Capacity Validation

Sprint Fit Assessment: Tight for single sprint
Total Effort Assessment: Borderline

Story Splitting Recommendations

  1. Automated Backup and Disaster Recovery #167-A: Daily backup + WAL archiving + restore script + DR runbook (XL(8))
  2. Automated Backup and Disaster Recovery #167-B: Backup verification + retention + monitoring + S3 replication (M(3))

Dependencies and Coordination

Story Dependencies

Prerequisite Stories: #166 (Encryption — backups must be encrypted), Epic #66 #149 (DB schema established)
Dependent Stories: None

External Dependencies

Infrastructure Requirements: Backup S3 bucket (separate from production), secondary region for replication

Validation and Testing Strategy

Acceptance Testing Approach

Testing Methods: Full DR drill: create data → backup → destroy → restore → verify data; WAL replay test; backup verification automation test
Test Data Requirements: Representative dataset for restore timing verification
Environment Requirements: Isolated restore environment, backup S3 bucket

Notes

Refinement Insights: If using managed DB (RDS), automated backups and WAL archiving are built-in — work focuses on configuration, monitoring, and restore scripting.

Technical Analysis

Implementation Approach

Technical Strategy: Use WAL-G (or native RDS backups for managed) for DB backup + WAL archiving. S3 cross-region replication via bucket policy. Restore script: stop service → restore DB from WAL-G → verify → restart. Verification: weekly cron that restores to test DB and runs integrity checks.
Key Components: Backup cron (WAL-G or pg_dump), WAL archiver, restore script, verification job, retention cleanup, backup monitoring alerts
Data Flow: DB → WAL archiving → S3 (continuous) | Daily cron → full backup → S3 | Weekly → restore to test → validate → report

Technical Requirements

  • WAL-G for PostgreSQL backup and WAL archiving (or RDS automated backups)
  • Backup bucket: s3://pair-backups/{org}/{type}/{timestamp}/ with SSE-KMS encryption
  • Cross-region replication: S3 replication rule on packages bucket
  • Restore script: bash/TypeScript script that automates: fetch backup → pg_restore → WAL replay → health check
  • Verification: cron that runs restore → SELECT count(*) FROM organizations + sample checks → report

Technical Risks and Mitigation

Risk Impact Probability Mitigation Strategy
Restore takes longer than RTO under large data High Medium Regular DR drills; optimize pg_restore parallelism
WAL archiving causes DB performance degradation Medium Low Monitor WAL lag; tune checkpoint frequency

Spike Requirements

Required Spikes: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    user storyWork item representing a user story

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions