In this project I am taking on the role of a Data Engineer to create a database schema and an ETL pipeline for a fictional startup called Sparkify. The database schema is created on Postgres database for optimal queries on song plays. The project tasks involved the design of fact and dimension tables for a star schema and the development of Python and SQL scripts to create an ETL pipeline that transfers data from files in two local directories into tables in Postgres.
Using the song and log datasets, a star schema optimized for queries on song play analysis was created based on the following entity relationship diagram:
This project consists of two datasets, and the first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The second dataset consists of log files in JSON format generated by an Event Simulator based on the songs in the first dataset. The log files simulate activity logs from a music streaming app based on specified configurations.
In addition to the data files, the project workspace includes six files:
- test.ipynb displays the first few rows of each table to let you check your database.
- create_tables.py drops and creates database tables. This file is run on the command line to reset tables prior to running the ETL scripts.
- etl.ipynb reads and processes a single file from song_data and log_data and loads the data into database tables. This notebook contains detailed instructions on the ETL process for each of the tables.
- etl.py reads and processes files from song_data and log_data and loads them into the fact and dimension tables.
- sql_queries.py contains all the projects sql queries, and is imported into the last three files above.
- psycopg2
- Pandas
- glob
- os
Jupyter notebook and python 3 are needed to run the notebooks and python scripts.
Download the required data sets and if required modify the directory paths.
python create_tables.py
python etl.py