Description

Its a case study for ETL process with Postgres sample database DVD-rental.
The goal is transform Normalized database into Star Schema and load to AWS S3 and AWS Redshift

PySpark is used for data operations
Airflow used for all orchestration

Star Schema Transformation:

ETL pipeline

Development

Dvdrental database is loaded to Postgres by container initialization
Everything else is constructed on Airflow's docker-compose example.

├── docker-compose.yaml
├── docker_images
│   ├── airflow
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── postgresql
│       ├── Dockerfile
│       ├── dvdrental.tar
│       └── init.sh

SQL scripts used for creating tables with PostgresOperator on RedShift and PySpark apps used for reading data from Postgres and load into S3 and RedShift

├── dags
│   ├── dvdrental-dwh.py
│   ├── spark_apps
│   │   ├── load_dim_customer.py
│   │   ├── load_dim_movie.py
│   │   ├── load_dim_payment_date.py
│   │   ├── load_dim_rental_date.py
│   │   ├── load_dim_return_date.py
│   │   ├── load_dim_staff.py
│   │   ├── load_dim_store.py
│   │   ├── load_fact_sales.py
│   │   ├── postgresql-42.5.1.jar
│   │   └── redshift-jdbc42-2.1.0.11.jar
│   └── sql_scripts
│       ├── create_dim_customer.sql
│       ├── create_dim_movie.sql
│       ├── create_dim_payment_date.sql
│       ├── create_dim_rental_date.sql
│       ├── create_dim_return_date.sql
│       ├── create_dim_staff.sql
│       ├── create_dim_store.sql
│       └── create_fact_sales.sql

Checkables dag created by custom operator 'RedshiftCheckTables' is under plugin folder

├── plugins
│   ├── __init__.py
│   ├── operators
│   │   ├── data_quality.py
│   │   ├── __init__.py

Usage

Initialization and Starting enviroment with docker-compose file is very easy and explained here
After containers are ready, credentials and Airflow-RedShift connection must be defined
To define credentials, credentials.json can be filled and exported to Airflow variables with Airflow UI (http://localhost:8080/variable/list/)
To define Airflow-RedShift connection, a redshift connection must be created with Airflow UI, connection name is 'redshift' in dag file so it must be 'redshift' when its created

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Development

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dags		dags
docker_images		docker_images
plugins		plugins
README.md		README.md
credentials.json		credentials.json
docker-compose.yaml		docker-compose.yaml

ercan5535/Postgres-dvdrental-dwh-Airflow-PySpark-Redshift

Folders and files

Latest commit

History

Repository files navigation

Description

Development

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages