Its a case study for ETL process with Postgres sample database DVD-rental.
The goal is transform Normalized database into Star Schema and load to AWS S3 and AWS Redshift
PySpark is used for data operations
Airflow used for all orchestration
Star Schema Transformation:
ETL pipeline
- Dvdrental database is loaded to Postgres by container initialization
- Everything else is constructed on Airflow's docker-compose example.
├── docker-compose.yaml
├── docker_images
│ ├── airflow
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ └── postgresql
│ ├── Dockerfile
│ ├── dvdrental.tar
│ └── init.sh
- SQL scripts used for creating tables with PostgresOperator on RedShift and PySpark apps used for reading data from Postgres and load into S3 and RedShift
├── dags
│ ├── dvdrental-dwh.py
│ ├── spark_apps
│ │ ├── load_dim_customer.py
│ │ ├── load_dim_movie.py
│ │ ├── load_dim_payment_date.py
│ │ ├── load_dim_rental_date.py
│ │ ├── load_dim_return_date.py
│ │ ├── load_dim_staff.py
│ │ ├── load_dim_store.py
│ │ ├── load_fact_sales.py
│ │ ├── postgresql-42.5.1.jar
│ │ └── redshift-jdbc42-2.1.0.11.jar
│ └── sql_scripts
│ ├── create_dim_customer.sql
│ ├── create_dim_movie.sql
│ ├── create_dim_payment_date.sql
│ ├── create_dim_rental_date.sql
│ ├── create_dim_return_date.sql
│ ├── create_dim_staff.sql
│ ├── create_dim_store.sql
│ └── create_fact_sales.sql
- Checkables dag created by custom operator 'RedshiftCheckTables' is under plugin folder
├── plugins
│ ├── __init__.py
│ ├── operators
│ │ ├── data_quality.py
│ │ ├── __init__.py
- Initialization and Starting enviroment with docker-compose file is very easy and explained here
- After containers are ready, credentials and Airflow-RedShift connection must be defined
- To define credentials, credentials.json can be filled and exported to Airflow variables with Airflow UI (http://localhost:8080/variable/list/)
- To define Airflow-RedShift connection, a redshift connection must be created with Airflow UI, connection name is 'redshift' in dag file so it must be 'redshift' when its created