-
Notifications
You must be signed in to change notification settings - Fork 37
Description
🎯 Problem Statement
We want to periodically refresh our UAT environment database using production data, but in a fully anonymized form so that no real personally identifiable information (PII) or sensitive health data is exposed in non-production environments.
Right now, there is no standardized, automated way to:
- Take a production DB backup,
- Anonymize sensitive data,
- Restore the anonymized backup into UAT.
This makes it harder to:
- Test against realistic datasets,
- Reproduce production bugs,
- Share UAT access safely with contributors, testers, and external partners.
🎯 Goal / Expected Outcome
Create a repeatable, script-driven workflow to:
- Take a backup of the production database (triggered manually or via CI, but always from a secure environment).
- Anonymize all sensitive fields (name, phone, address, ABHA ID, Beneficiary ID, etc.) using deterministic or realistic fake data.
- Create a SQL dump that can be restored to UAT.
- Make this process documented, auditable, and safe to run periodically.
Outcome: We should be able to run one command or pipeline preferably in Jenkins and end up with a fresh, anonymized UAT database that is safe for testing and community contributions.
🔐 Data Privacy & Security Constraints
- Anonymization must cover all PII / sensitive fields, including:
- Patient names
- Phone numbers
- Addresses / location info (as required)
- ABHA ID / other identifiers
- Any other PII in free-text or notes fields (as feasible)
- IDs/keys should remain structurally consistent so that:
- Foreign key relationships still work,
- Workflows & analytics can still be tested.
- The script must not be run from untrusted machines; only from secure infra (e.g., bastion / CI job with proper permissions).
🛠️ Proposed Approach
This can be refined during implementation – this is a starting point.
-
Backup Step
- Use existing DB tooling (e.g.,
mysqldump,pg_dumpor equivalent for our DB) to take a production backup.
- Use existing DB tooling (e.g.,
-
Anonymization Step
- Write a script (e.g., Python / shell + SQL) that:
- Restores the dump into a temporary database.
- Runs a series of
UPDATEstatements / functions to:- Replace names with fake but realistic ones.
- Replace phone numbers with dummy patterns.
- Mask or hash national IDs / ABHA IDs / Beneficiary IDs as required.
- Scrub or partially mask addresses.
- Optionally, log a summary of what was anonymized (count of rows per table).
- Write a script (e.g., Python / shell + SQL) that:
-
Export Anonymized Dump
- Export the anonymized DB into a new dump file ready for UAT.
-
Restore into UAT
- Drop/replace UAT DB (or use a separate schema depending on our infra).
- Restore anonymized dump.
- Run any post-migration scripts (migrations, seeds, etc.).
-
Automate
- Wrap the above into:
- A single script (
anonymize_and_refresh_uat.sh/anonymize_refresh.py), and/or - A CI/CD pipeline job (GitHub Actions / Jenkins / etc.) with manual trigger.
- A single script (
- Wrap the above into:
✅ Acceptance Criteria
- A script or set of scripts exists to:
- Take a production backup (or work from an existing backup).
- Anonymize all defined PII fields.
- Restore the anonymized DB into UAT.
- Clear documentation is added under
docs/(or README section) of AMRIT-DB repo explaining:- Prerequisites & environment variables.
- How to run the anonymization+restore flow step by step.
- Which tables/fields are anonymized and how.
- The process is tested on a non-production copy of the DB first.
- The UAT DB after refresh:
- Works for core application flows.
- Contains no real PII in inspected tables.
- Production credentials or raw dumps are never exposed in logs or public repos.
🧩 Breakdown of Tasks
-
Inventory of Sensitive Data
- Identify all tables and fields with PII/sensitive data.
- Document them in the issue or a linked doc.
-
Design Anonymization Rules
- Decide on masking strategies:
- Names → fake names
- Phone numbers → dummy format but valid pattern
- IDs → hashed or random but unique
- Addresses → generalized or fake
- Ensure referential integrity is preserved.
- Decide on masking strategies:
-
Implement Anonymization Script
- Implement SQL-based or language-based (Python, etc.) anonymization functions.
- Make it configurable via environment variables (DB host, user, etc.).
-
Implement Backup & Restore Flow
- Script backup & restore steps around the anonymization step.
- Add safety checks (e.g., “Refuse to run if target DB is production”).
-
Document the Process
- Add documentation page: e.g.,
docs/db-anonymization-and-uat-refresh.md. - Include examples of commands, environment configuration, and safety notes.
- Add documentation page: e.g.,
-
Dry Runs & Validation
- Test on a staging copy of the DB.
- Verify application behavior and data anonymization manually.
🔗 References
- AMRIT-DB GitHub: https://github.com/PSMRI/AMRIT-DB/
🙌 Who is this good for?
- Contributors interested in DevOps / Data Engineering / Security.
- Folks who like working with databases, automation, and privacy-aware tooling.