Data Warehouse

Project Data Warehouse as part of the Udactiy Data Engineering Nanodegree.

Project Summary

An implementation of a Data Warehouse leveraging AWS RedShift. This projects contains the ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team.

The data on S3 contains song and log information from a music store. This solution enables music stores to easily process loads of information efficiently.

Purpose of this project

This projects processes data from different sources (in this case multiple S3 buckets) in a way that it can be analyzed easily and efficiently. The startup Sparkify thus enjoys the eased analysis of the run-time data of their application.

Project instructions

Setup a redshift cluster on AWS and insert the connection details in dwh.cfg.
Create the needed the database structure by executing create_tables.py.
Process the data from the configured S3 data sources by executing etl.py.

Database schema

Table	Description
staging_events	stating table for event data
staging_songs	staging table for song data
songplays	information how songs were played, e.g. when by which user in which session
users	user-related information such as name, gender and level
songs	song-related information containing name, artist, year and duration
artists	artist name and location (geo-coords and textual location)
time	time-related info for timestamps

ETL pipeline

Load song and log data both from S3 buckets.
Stage the loaded data.
Transform the data into the above described data schema.

Example queries

Find all users at a certain location:

SELECT DISTINCT users.user_id FROM users JOIN songplays ON songplays.user_id = users.user_id WHERE songplays.location = <LOCATION>

Find all songs by a given artist:

SELECT songs.song_id FROM songs JOIN artists ON songs.artist_id = artists.artist_id WHERE artist.name = <ARTIST>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Warehouse

Project Summary

Purpose of this project

Project instructions

Database schema

ETL pipeline

Example queries

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Warehouse

Project Summary

Purpose of this project

Project instructions

Database schema

ETL pipeline

Example queries