Sparkify, a rapidly growing music streaming startup, is looking to leverage the cloud to handle its expanding user base and song database. This repository contains the necessary tools and pipelines to extract Sparkify's data from Amazon S3, stage it on Amazon Redshift, and transform it into a structured format suitable for analytics purposes.
This project aims to construct an ETL (Extract, Transform, Load) pipeline for Sparkify's data that is currently hosted in Amazon S3. The data in S3 is organized into two main directories:
- User Activity Logs: JSON logs that detail user activities on the Sparkify app.
- Song Metadata: JSON metadata that contains information about the songs available on the Sparkify platform.
The main objectives of this project are:
- Extract the data from S3.
- Stage the extracted data in Redshift.
- Transform the staged data into a set of dimensional tables, optimized for the analytics team's queries about user song plays.
- AWS account with access to S3 and Redshift services.
- Python 3.x
- Libraries:
psycopg2
,boto3
- Redshift cluster with necessary IAM roles for S3 read access.
-
Clone this repository:
git clone https://github.com/nadyavoynich/DataEngineering-ND-CloudETL-Redshift cd Sparkify-CloudETL-Pipeline
-
Create a configuration file named
dwh.cfg
in the root directory with the following format:[CLUSTER] REGION='us-west-2' [DWH] HOST=YOUR_REDSHIFT_ENDPOINT DB_NAME=YOUR_DB_NAME DB_USER=YOUR_DB_USER DB_PASSWORD=YOUR_DB_PASSWORD DB_PORT=5439 [IAM_ROLE] ARN=YOUR_ARN [S3] LOG_DATA='s3://udacity-dend/log_data' LOG_JSONPATH='s3://udacity-dend/log_json_path.json' SONG_DATA='s3://udacity-dend/song_data'
Replace placeholders with your actual AWS and Redshift details.
-
Create tables (or drop and recreate tables):
python create_tables.py
-
Run the ETL pipeline:
python etl.py
After running these commands, your Redshift database will have the dimensional tables loaded with data from S3, ready for analytics queries.
- Check data quality by simple counts:
data_quality_check.py
The database is structured into fact and four dimension tables optimized for song play analyses.
The primary fact table is songplays
, and the dimension tables are users
, songs
, artists
, and time
.
- Create a dashboard for analytic queries on the database.
This project is part of the Data Engineering Nanodegree Program provided by Udacity.