Data Modelling with Postgres

In this project I am taking on the role of a Data Engineer to create a database schema and an ETL pipeline for a fictional startup called Sparkify. The database schema is created on Postgres database for optimal queries on song plays. The project tasks involved the design of fact and dimension tables for a star schema and the development of Python and SQL scripts to create an ETL pipeline that transfers data from files in two local directories into tables in Postgres.

Schema for Song Play Analysis

Using the song and log datasets, a star schema optimized for queries on song play analysis was created based on the following entity relationship diagram:

Data and Code

This project consists of two datasets, and the first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The second dataset consists of log files in JSON format generated by an Event Simulator based on the songs in the first dataset. The log files simulate activity logs from a music streaming app based on specified configurations.

In addition to the data files, the project workspace includes six files:

test.ipynb displays the first few rows of each table to let you check your database.
create_tables.py drops and creates database tables. This file is run on the command line to reset tables prior to running the ETL scripts.
etl.ipynb reads and processes a single file from song_data and log_data and loads the data into database tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py reads and processes files from song_data and log_data and loads them into the fact and dimension tables.
sql_queries.py contains all the projects sql queries, and is imported into the last three files above.

Prerequisites

psycopg2
Pandas
glob
os

Jupyter notebook and python 3 are needed to run the notebooks and python scripts.

Instructions on running the application

Download the required data sets and if required modify the directory paths.

python create_tables.py

python etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Modelling with Postgres

Table of contents

Schema for Song Play Analysis

Data and Code

Prerequisites

Instructions on running the application

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
songplays_erd.PNG		songplays_erd.PNG
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

dkaberna/Data-Modeling-with-Postgres

Folders and files

Latest commit

History

Repository files navigation

Data Modelling with Postgres

Table of contents

Schema for Song Play Analysis

Data and Code

Prerequisites

Instructions on running the application

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages