Skip to content

dkaberna/Data-Modeling-with-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Modelling with Postgres

In this project I am taking on the role of a Data Engineer to create a database schema and an ETL pipeline for a fictional startup called Sparkify. The database schema is created on Postgres database for optimal queries on song plays. The project tasks involved the design of fact and dimension tables for a star schema and the development of Python and SQL scripts to create an ETL pipeline that transfers data from files in two local directories into tables in Postgres.

Table of contents

Schema for Song Play Analysis

Using the song and log datasets, a star schema optimized for queries on song play analysis was created based on the following entity relationship diagram: ERD image

Data and Code

This project consists of two datasets, and the first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The second dataset consists of log files in JSON format generated by an Event Simulator based on the songs in the first dataset. The log files simulate activity logs from a music streaming app based on specified configurations.

In addition to the data files, the project workspace includes six files:

  • test.ipynb displays the first few rows of each table to let you check your database.
  • create_tables.py drops and creates database tables. This file is run on the command line to reset tables prior to running the ETL scripts.
  • etl.ipynb reads and processes a single file from song_data and log_data and loads the data into database tables. This notebook contains detailed instructions on the ETL process for each of the tables.
  • etl.py reads and processes files from song_data and log_data and loads them into the fact and dimension tables.
  • sql_queries.py contains all the projects sql queries, and is imported into the last three files above.

Prerequisites

  • psycopg2
  • Pandas
  • glob
  • os

Jupyter notebook and python 3 are needed to run the notebooks and python scripts.

Instructions on running the application

Download the required data sets and if required modify the directory paths.

python create_tables.py

python etl.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published