Skip to content

[C4GT Community] Automate anonymized production DB snapshot for UAT #112

@drtechie

Description

@drtechie

🎯 Problem Statement

We want to periodically refresh our UAT environment database using production data, but in a fully anonymized form so that no real personally identifiable information (PII) or sensitive health data is exposed in non-production environments.

Right now, there is no standardized, automated way to:

  1. Take a production DB backup,
  2. Anonymize sensitive data,
  3. Restore the anonymized backup into UAT.

This makes it harder to:

  • Test against realistic datasets,
  • Reproduce production bugs,
  • Share UAT access safely with contributors, testers, and external partners.

🎯 Goal / Expected Outcome

Create a repeatable, script-driven workflow to:

  1. Take a backup of the production database (triggered manually or via CI, but always from a secure environment).
  2. Anonymize all sensitive fields (name, phone, address, ABHA ID, Beneficiary ID, etc.) using deterministic or realistic fake data.
  3. Create a SQL dump that can be restored to UAT.
  4. Make this process documented, auditable, and safe to run periodically.

Outcome: We should be able to run one command or pipeline preferably in Jenkins and end up with a fresh, anonymized UAT database that is safe for testing and community contributions.


🔐 Data Privacy & Security Constraints

  • Anonymization must cover all PII / sensitive fields, including:
    • Patient names
    • Phone numbers
    • Addresses / location info (as required)
    • ABHA ID / other identifiers
    • Any other PII in free-text or notes fields (as feasible)
  • IDs/keys should remain structurally consistent so that:
    • Foreign key relationships still work,
    • Workflows & analytics can still be tested.
  • The script must not be run from untrusted machines; only from secure infra (e.g., bastion / CI job with proper permissions).

🛠️ Proposed Approach

This can be refined during implementation – this is a starting point.

  1. Backup Step

    • Use existing DB tooling (e.g., mysqldump, pg_dump or equivalent for our DB) to take a production backup.
  2. Anonymization Step

    • Write a script (e.g., Python / shell + SQL) that:
      • Restores the dump into a temporary database.
      • Runs a series of UPDATE statements / functions to:
        • Replace names with fake but realistic ones.
        • Replace phone numbers with dummy patterns.
        • Mask or hash national IDs / ABHA IDs / Beneficiary IDs as required.
        • Scrub or partially mask addresses.
      • Optionally, log a summary of what was anonymized (count of rows per table).
  3. Export Anonymized Dump

    • Export the anonymized DB into a new dump file ready for UAT.
  4. Restore into UAT

    • Drop/replace UAT DB (or use a separate schema depending on our infra).
    • Restore anonymized dump.
    • Run any post-migration scripts (migrations, seeds, etc.).
  5. Automate

    • Wrap the above into:
      • A single script (anonymize_and_refresh_uat.sh / anonymize_refresh.py), and/or
      • A CI/CD pipeline job (GitHub Actions / Jenkins / etc.) with manual trigger.

✅ Acceptance Criteria

  • A script or set of scripts exists to:
    • Take a production backup (or work from an existing backup).
    • Anonymize all defined PII fields.
    • Restore the anonymized DB into UAT.
  • Clear documentation is added under docs/ (or README section) of AMRIT-DB repo explaining:
    • Prerequisites & environment variables.
    • How to run the anonymization+restore flow step by step.
    • Which tables/fields are anonymized and how.
  • The process is tested on a non-production copy of the DB first.
  • The UAT DB after refresh:
    • Works for core application flows.
    • Contains no real PII in inspected tables.
  • Production credentials or raw dumps are never exposed in logs or public repos.

🧩 Breakdown of Tasks

  1. Inventory of Sensitive Data

    • Identify all tables and fields with PII/sensitive data.
    • Document them in the issue or a linked doc.
  2. Design Anonymization Rules

    • Decide on masking strategies:
      • Names → fake names
      • Phone numbers → dummy format but valid pattern
      • IDs → hashed or random but unique
      • Addresses → generalized or fake
    • Ensure referential integrity is preserved.
  3. Implement Anonymization Script

    • Implement SQL-based or language-based (Python, etc.) anonymization functions.
    • Make it configurable via environment variables (DB host, user, etc.).
  4. Implement Backup & Restore Flow

    • Script backup & restore steps around the anonymization step.
    • Add safety checks (e.g., “Refuse to run if target DB is production”).
  5. Document the Process

    • Add documentation page: e.g., docs/db-anonymization-and-uat-refresh.md.
    • Include examples of commands, environment configuration, and safety notes.
  6. Dry Runs & Validation

    • Test on a staging copy of the DB.
    • Verify application behavior and data anonymization manually.

🔗 References


🙌 Who is this good for?

  • Contributors interested in DevOps / Data Engineering / Security.
  • Folks who like working with databases, automation, and privacy-aware tooling.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions