Sparkify Songplay Data ETL

Background

This project build a data pipeline for Sparkify using Apache Airflow to automatically and periodically pull user behavior data in Amazon s3 and load them into formalized databases in readshift for analysis.

Architecture

Project structure:

dags/sparkify_etl_dag.py: defines the dag for Sparkify data etl;
plugins/helpers/sql_queries.py: stores all sqls used in this project;
plugins/operators: stores all operators used in the dag.

Usage:

pip install airflow
Go to your $AIRFLOW_HOME holder, replace dags and plugins folder with the folders in this project.
airflow initdb, airflow webserver, airflow scheduler, airflow worker
Check and manage the dag on airflow webui (need to configure your AWS s3 and redshift credentials first if you run the dag).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dags		dags
plugins		plugins
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Songplay Data ETL

Background

Architecture

Project structure:

Usage:

About

Releases

Packages

Languages

GavintheAmateur/airflow

Folders and files

Latest commit

History

Repository files navigation

Sparkify Songplay Data ETL

Background

Architecture

Project structure:

Usage:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages