Skip to content

CloudNinjaDev/replication-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MongoDB Database Replication & Anonymization

Automated infrastructure for deploying MongoDB on AWS EC2 with EBS volumes, and safely cloning production data to staging environments with PII anonymization.

Terraform CI Terraform CD Prod to Staging Sync Cleanup Resources


πŸ“… Last Restore Status

Environment Last Restored Status Documents Anonymized Duration
Staging 2026-03-03 10:51 UTC βœ… Success 10 βœ… Yes 2m 56s

πŸ’‘ Tip: This table is automatically updated after each successful sync using the GitHub Actions workflow.


🎯 Overview

This project provides:

  • Terraform Infrastructure: Deploy MongoDB on EC2 with dedicated EBS data volumes
  • GitHub Actions Automation: Clone production MongoDB volumes to staging
  • Data Anonymization: Automatically anonymize PII data in staging
  • Resource Management: Track and cleanup old snapshots/volumes
  • CI/CD Ready: Automated workflows with self-hosted runners

πŸ“ Project Structure

.
β”œβ”€β”€ terraform/                  # Infrastructure as Code
β”‚   β”œβ”€β”€ modules/
β”‚   β”‚   └── ec2/               # EC2 + EBS module for MongoDB
β”‚   └── stacks/
β”‚       β”œβ”€β”€ production/        # Production environment
β”‚       └── staging/           # Staging environment
β”œβ”€β”€ mongodb/                   # Database scripts
β”‚   β”œβ”€β”€ setup_database.js     # Create DB with mock PII data
β”‚   β”œβ”€β”€ anonymize_data.js     # Simple anonymization
β”‚   β”œβ”€β”€ anonymize_with_hash.js # Hash-based anonymization
β”‚   └── restore_original_data.js
β”œβ”€β”€ .cleanup/                  # Resource tracking
β”‚   β”œβ”€β”€ resource-tracker.json # Tracked snapshots/volumes
β”‚   └── README.md            # Cleanup documentation
└── .github/
    └── workflows/
        β”œβ”€β”€ prod-to-staging-sync.yml      # Production to staging sync
        β”œβ”€β”€ cleanup-resources.yml         # Automated resource cleanup
        β”œβ”€β”€ README-SYNC.md               # Workflow documentation
        └── QUICKSTART-SYNC.md           # Quick start guide

πŸš€ Quick Start

1. Deploy Infrastructure with Terraform

# Deploy staging environment
cd terraform/stacks/staging
terraform init
terraform plan
terraform apply

# Deploy production environment
cd ../production
terraform init
terraform apply

What gets created:

  • EC2 instance with Amazon Linux 2023
  • Root volume (8GB) for OS
  • Data volume (20GB) for MongoDB
  • Security groups for SSH and MongoDB access
  • MongoDB 7.0 installed and configured

2. Setup Initial Database

# SSH into production instance
ssh ec2-user@<production-ip>

# Copy and run the setup script
mongosh < setup_database.js

3. Clone Production to Staging

Using GitHub Actions Workflow:

  1. Add AWS credentials to GitHub Secrets (AWS_SECRET_ACCESS_ID, AWS_SECRET_ACCESS_KEY)
  2. Go to Actions β†’ "Production to Staging DB Sync"
  3. Click "Run workflow"
  4. Select anonymization option (default: enabled)
  5. Monitor progress through 9 separate jobs

πŸ“– Quick Start | Full Docs

What happens during sync:

  • βœ… Validates AWS resources
  • βœ… Stops staging MongoDB
  • βœ… Creates snapshot from production
  • βœ… Swaps EBS volumes
  • βœ… Mounts new volume
  • βœ… Starts MongoDB
  • βœ… Anonymizes PII data (optional)
  • βœ… Tracks old resources for cleanup
  • βœ… Updates README with sync status

πŸ“‹ Features

Infrastructure (Terraform)

  • βœ… EC2 instances with MongoDB 7.0
  • βœ… Separate EBS volumes for data
  • βœ… Auto-mounting and configuration via user_data
  • βœ… Security groups with proper access controls
  • βœ… Support for multiple environments (staging/production)
  • βœ… XFS filesystem (MongoDB recommended)

Automation (GitHub Actions)

  • βœ… Automated EBS snapshot creation
  • βœ… Volume-based replication
  • βœ… Zero-downtime for production
  • βœ… Automatic volume swap
  • βœ… Self-hosted runner on staging
  • βœ… Built-in data anonymization
  • βœ… Resource tracking for cleanup
  • βœ… Auto-update README status

Data Management

  • βœ… Mock PII data generation
  • βœ… Simple anonymization (User 1, User 2, etc.)
  • βœ… Hash-based anonymization (non-reversible)
  • βœ… Data restoration scripts
  • βœ… 10 sample users with realistic PII

πŸ” Security & Compliance

PII Data Protection

The project includes two anonymization strategies:

Simple Anonymization:

  • Names β†’ "User [ID]"
  • Email β†’ "user[ID]@anonymized.local"
  • SSN β†’ "XXX-XX-[ID]"
  • Address β†’ Redacted values

Hash-Based Anonymization:

  • One-way hash functions
  • Non-reversible transformation
  • Maintains referential consistency
  • Suitable for production-like testing

AWS Security

  • EC2 security groups restrict access
  • EBS volumes encrypted at rest
  • SSM for secure command execution
  • IAM roles with least privilege
  • Snapshots properly tagged

πŸ“– Documentation

πŸ”„ GitHub Actions Workflows

Production to Staging Sync

File: prod-to-staging-sync.yml

✨ Features:

  • 9 separate jobs for granular control
  • Self-hosted runner on staging EC2
  • GitHub-hosted runner for AWS API calls
  • Detailed per-job summaries
  • Maximum visibility and debugging
  • Fail-fast at job level
  • ~3-10 minutes duration
  • Automatic README updates
  • Resource tracking for safe cleanup

πŸ“Š Jobs:

  1. Setup & Validation - Discover volume IDs
  2. Stop MongoDB - Stop service on staging
  3. Create Snapshot - Snapshot production volume
  4. Swap Volumes - Create and attach new volume
  5. Mount Volume - Mount on staging EC2
  6. Start & Verify - Start MongoDB and verify data
  7. Anonymize Data - Optional PII anonymization
  8. Track Resources - Track snapshot/volume for cleanup
  9. Final Summary - Update README and report

Automated Resource Cleanup

File: cleanup-resources.yml

✨ Features:

  • Scheduled daily cleanup (2 AM UTC)
  • Manual trigger with dry-run option
  • Age-based deletion (default: 1 day)
  • Tracks all snapshots and volumes
  • Safe rollback capability
  • Automatic tracker file updates

Cleanup Options:

  • Scheduled: Runs daily at 2 AM UTC, deletes resources older than 1 day
  • Manual: Run on-demand with custom age threshold and dry-run preview

Workflow Benefits

Feature Description
Zero Production Impact Works on snapshots, production untouched
Self-Hosted Execution MongoDB operations run on staging EC2
Resource Management Track and cleanup old snapshots/volumes
Safety First Dry-run mode, rollback instructions
Cost Efficient Automated cleanup prevents AWS cost buildup
Visibility Per-job logs and comprehensive summaries
Automation Scheduled cleanup, auto-update README

πŸ› οΈ Prerequisites

For Terraform

  • Terraform 1.0+
  • AWS CLI configured
  • AWS credentials with EC2/VPC permissions

For GitHub Actions Workflows

  • AWS credentials (stored as GitHub Secrets: AWS_SECRET_ACCESS_ID, AWS_SECRET_ACCESS_KEY)
  • Self-hosted runner configured on staging EC2
  • IAM permissions for EC2, EBS, Snapshots
  • GitHub repository access to Actions

πŸ”„ Typical Workflow

  1. Initial Setup: Deploy infrastructure with Terraform
  2. Populate Production: Load production data
  3. Clone to Staging: Use GitHub Actions workflow to sync and anonymize
  4. Test in Staging: Verify functionality with anonymized data
  5. Cleanup Resources: Run cleanup workflow to delete old snapshots/volumes
  6. Repeat: Run sync as needed (on-demand or scheduled)

πŸ’° Cost Estimate

Per Environment (Monthly):

  • EC2 t3.large: ~$60/month
  • EBS Root (8GB gp3): ~$0.64/month
  • EBS Data (20GB gp3): ~$1.60/month
  • Data transfer: Free (same region)
  • Total: ~$62/month per environment

Snapshots:

  • ~$0.05/GB/month (incremental)
  • 20GB snapshot: ~$1/month

🎯 Use Cases

  • Development/Testing: Safe staging environment with anonymized data
  • Compliance: Meet GDPR/CCPA requirements for test data
  • Disaster Recovery: Practice restoration procedures
  • Performance Testing: Use production-sized datasets
  • Training: Onboard new team members safely

πŸ”§ Configuration

Terraform Variables

Edit terraform/stacks/staging/terraform.tfvars:

instance_type = "t3.large"
root_volume_size = 8
mongodb_data_volume_size = 20

Workflow Configuration

Edit workflow environment variables in .github/workflows/prod-to-staging-sync.yml:

env:
  AWS_REGION: us-west-2
  PROD_INSTANCE_ID: i-0e360e7615a63a796
  STAGING_INSTANCE_ID: i-05661b198eb8d9b0a

Cleanup Schedule

Edit cron schedule in .github/workflows/cleanup-resources.yml:

schedule:
  - cron: '0 2 * * *'  # Daily at 2 AM UTC

πŸ“Š Monitoring

Check MongoDB status:

sudo systemctl status mongod

View data volume:

df -h | grep mongodb

Count documents:

mongosh --eval "use userdb; db.users.countDocuments()"

πŸ› Troubleshooting

Common Issues

Workflow fails to find volumes:

  • Verify instance IDs in workflow env variables
  • Check that volumes are attached to /dev/sdf
  • Ensure AWS credentials have EC2 describe permissions

Self-hosted runner offline:

  • SSH to staging EC2 and check runner status
  • Restart runner service if needed
  • Verify GitHub runner token hasn't expired

Cleanup finds 0 resources:

  • Check max_age_days setting (use 0 for same-day cleanup)
  • Verify .cleanup/resource-tracker.json has entries
  • Date comparison uses <= so same-day resources are included

README not updating:

  • Check final-summary job logs
  • Verify git push succeeded (check for conflicts)
  • Ensure workflow has contents: write permission

See detailed documentation:

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

This project is provided as-is for educational and internal use purposes.

πŸ™‹ Support

For issues or questions:

  1. Check the documentation in ansible/README.md
  2. Review troubleshooting guides
  3. Check AWS CloudWatch logs
  4. Verify IAM permissions

πŸŽ“ Learning Resources

About

Restore Staging DB from a self-managed production database and anonymise the PII in staging

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors