This repository is dedicated to honing skills in healthcare data engineering through practical projects and exercises with support from Synthea, a synthetic clinical data simulator to output realistic, but not real, patient data. The objective behind this repository is to provide hands-on experience by leveraging Python and SQL programming languages, along with a diverse set of technologies and tools commonly used in the field of data engineering.
- Python
- SQL
- Docker
- Terraform
- PostgreSQL
- Google Cloud Platform (GCP)
- Mage (alternative to Airflow)
- BigQuery
- DBT (Data Build Tool)
- Apache Spark (Python & SQL)
- Kafka
- Faust
- KSQL
- ksqlDB
- Make
- Module 1: Containerization and Infrastructure as Code (IaC)
- Docker
- Terraform
- GCP
- Module 2: Workflow Orchestration
- Data Lake
- Mage
- Airflow
- Module 3: Data Warehouse
- Data Warehouse
- BigQuery
- Module 4: Analytics engineering
- ELT vs. ETL
- DBT
- Testing (unit & integration testing)
- Module 5: Batch processing
- Apache Spark (Python & SQL)
- Module 6: Streaming
- Kafka
- Faust
- KSQL
- ksqlDB
- Exposure to examples with Java & Scala
- Workshop 1: Data Ingestion
- Workshop 2: Stream Processing with SQL