[C4GT Community] Automate anonymized production DB snapshot for UAT

## 🎯 Problem Statement

We want to periodically refresh our **UAT environment database** using **production data**, but in a **fully anonymized** form so that no real personally identifiable information (PII) or sensitive health data is exposed in non-production environments.

Right now, there is **no standardized, automated way** to:
1. Take a production DB backup,
2. Anonymize sensitive data,
3. Restore the anonymized backup into UAT.

This makes it harder to:
- Test against realistic datasets,
- Reproduce production bugs,
- Share UAT access safely with contributors, testers, and external partners.

---

## 🎯 Goal / Expected Outcome

Create a **repeatable, script-driven workflow** to:

1. **Take a backup of the production database** (triggered manually or via CI, but always from a secure environment).
2. **Anonymize all sensitive fields** (name, phone, address, ABHA ID, Beneficiary ID, etc.) using deterministic or realistic fake data.
3. Create a SQL dump that can be restored to UAT.
4. Make this process **documented, auditable, and safe** to run periodically.

> Outcome: We should be able to run *one command or pipeline* preferably in Jenkins and end up with a **fresh, anonymized UAT database** that is safe for testing and community contributions.

---

## 🔐 Data Privacy & Security Constraints

- Anonymization must cover all **PII / sensitive fields**, including:
  - Patient names
  - Phone numbers
  - Addresses / location info (as required)
  - ABHA ID / other identifiers
  - Any other PII in free-text or notes fields (as feasible)
- IDs/keys should remain **structurally consistent** so that:
  - Foreign key relationships still work,
  - Workflows & analytics can still be tested.
- The script must **not be run from untrusted machines**; only from secure infra (e.g., bastion / CI job with proper permissions).

---

## 🛠️ Proposed Approach

> This can be refined during implementation – this is a starting point.

1. **Backup Step**
   - Use existing DB tooling (e.g., `mysqldump`, `pg_dump` or equivalent for our DB) to take a production backup.

2. **Anonymization Step**
   - Write a script (e.g., Python / shell + SQL) that:
     - Restores the dump into a *temporary database*.
     - Runs a series of `UPDATE` statements / functions to:
       - Replace names with fake but realistic ones.
       - Replace phone numbers with dummy patterns.
       - Mask or hash national IDs / ABHA IDs / Beneficiary IDs as required.
       - Scrub or partially mask addresses.
     - Optionally, log a summary of what was anonymized (count of rows per table).

3. **Export Anonymized Dump**
   - Export the anonymized DB into a new dump file ready for UAT.

4. **Restore into UAT**
   - Drop/replace UAT DB (or use a separate schema depending on our infra).
   - Restore anonymized dump.
   - Run any post-migration scripts (migrations, seeds, etc.).

5. **Automate**
   - Wrap the above into:
     - A single script (`anonymize_and_refresh_uat.sh` / `anonymize_refresh.py`), and/or
     - A CI/CD pipeline job (GitHub Actions / Jenkins / etc.) with manual trigger.

---

## ✅ Acceptance Criteria

- [ ] A script or set of scripts exists to:
  - [ ] Take a production backup (or work from an existing backup).
  - [ ] Anonymize all defined PII fields.
  - [ ] Restore the anonymized DB into UAT.
- [ ] Clear documentation is added under `docs/` (or README section) of AMRIT-DB repo explaining:
  - [ ] Prerequisites & environment variables.
  - [ ] How to run the anonymization+restore flow step by step.
  - [ ] Which tables/fields are anonymized and how.
- [ ] The process is tested on a **non-production** copy of the DB first.
- [ ] The UAT DB after refresh:
  - [ ] Works for core application flows.
  - [ ] Contains no real PII in inspected tables.
- [ ] Production credentials or raw dumps are **never exposed** in logs or public repos.

---

## 🧩 Breakdown of Tasks

1. **Inventory of Sensitive Data**
   - Identify all tables and fields with PII/sensitive data.
   - Document them in the issue or a linked doc.

2. **Design Anonymization Rules**
   - Decide on masking strategies:
     - Names → fake names
     - Phone numbers → dummy format but valid pattern
     - IDs → hashed or random but unique
     - Addresses → generalized or fake
   - Ensure referential integrity is preserved.

3. **Implement Anonymization Script**
   - Implement SQL-based or language-based (Python, etc.) anonymization functions.
   - Make it configurable via environment variables (DB host, user, etc.).

4. **Implement Backup & Restore Flow**
   - Script backup & restore steps around the anonymization step.
   - Add safety checks (e.g., “Refuse to run if target DB is production”).

5. **Document the Process**
   - Add documentation page: e.g., `docs/db-anonymization-and-uat-refresh.md`.
   - Include examples of commands, environment configuration, and safety notes.

6. **Dry Runs & Validation**
   - Test on a staging copy of the DB.
   - Verify application behavior and data anonymization manually.

---

## 🔗 References

- AMRIT-DB GitHub: https://github.com/PSMRI/AMRIT-DB/

---

## 🙌 Who is this good for?

- Contributors interested in **DevOps / Data Engineering / Security**.
- Folks who like working with **databases, automation, and privacy-aware tooling**.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C4GT Community] Automate anonymized production DB snapshot for UAT #112

🎯 Problem Statement

🎯 Goal / Expected Outcome

🔐 Data Privacy & Security Constraints

🛠️ Proposed Approach

✅ Acceptance Criteria

🧩 Breakdown of Tasks

🔗 References

🙌 Who is this good for?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C4GT Community] Automate anonymized production DB snapshot for UAT #112

Description

🎯 Problem Statement

🎯 Goal / Expected Outcome

🔐 Data Privacy & Security Constraints

🛠️ Proposed Approach

✅ Acceptance Criteria

🧩 Breakdown of Tasks

🔗 References

🙌 Who is this good for?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions