Skip to content

ercan5535/Postgres-dvdrental-dwh-Airflow-PySpark-Redshift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

Its a case study for ETL process with Postgres sample database DVD-rental.
The goal is transform Normalized database into Star Schema and load to AWS S3 and AWS Redshift

PySpark is used for data operations
Airflow used for all orchestration


Star Schema Transformation:

ETL pipeline

Development

├── docker-compose.yaml
├── docker_images
│   ├── airflow
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── postgresql
│       ├── Dockerfile
│       ├── dvdrental.tar
│       └── init.sh
  • SQL scripts used for creating tables with PostgresOperator on RedShift and PySpark apps used for reading data from Postgres and load into S3 and RedShift
├── dags
│   ├── dvdrental-dwh.py
│   ├── spark_apps
│   │   ├── load_dim_customer.py
│   │   ├── load_dim_movie.py
│   │   ├── load_dim_payment_date.py
│   │   ├── load_dim_rental_date.py
│   │   ├── load_dim_return_date.py
│   │   ├── load_dim_staff.py
│   │   ├── load_dim_store.py
│   │   ├── load_fact_sales.py
│   │   ├── postgresql-42.5.1.jar
│   │   └── redshift-jdbc42-2.1.0.11.jar
│   └── sql_scripts
│       ├── create_dim_customer.sql
│       ├── create_dim_movie.sql
│       ├── create_dim_payment_date.sql
│       ├── create_dim_rental_date.sql
│       ├── create_dim_return_date.sql
│       ├── create_dim_staff.sql
│       ├── create_dim_store.sql
│       └── create_fact_sales.sql
  • Checkables dag created by custom operator 'RedshiftCheckTables' is under plugin folder
├── plugins
│   ├── __init__.py
│   ├── operators
│   │   ├── data_quality.py
│   │   ├── __init__.py

Usage

  • Initialization and Starting enviroment with docker-compose file is very easy and explained here
  • After containers are ready, credentials and Airflow-RedShift connection must be defined
  • To define credentials, credentials.json can be filled and exported to Airflow variables with Airflow UI (http://localhost:8080/variable/list/)
  • To define Airflow-RedShift connection, a redshift connection must be created with Airflow UI, connection name is 'redshift' in dag file so it must be 'redshift' when its created