This project is aimed at helping Sparkify, a music streaming startup, move its data and processes to the cloud. The goal is to build an ETL pipeline that extracts data from AWS S3, loads it into staging tables in a database hosted on Amazon Redshift, and transforms it into a star-schema-based data model to support analytics. This project handles large volumes of song metadata and user activity logs stored in S3, bringing them into a Redshift cluster for analysis by Sparkify's data team. The final output is a dimensional model with fact and dimension tables in a star schema in Redshift that allows for efficient querying to answer business questions such as popular songs, user listening patterns, and peak activity times.
The ETL pipeline involves the following steps:
- Extracting song metadata and user activity logs from S3.
- Loading the data into staging tables in a Redshift cluster.
- Transforming the staging data into a fact and dimension tables following a star schema.
The image below demonstrates the ETL process of moving data from S3 to Redshift:
The project implements a scalable cloud solution for Sparkify's analytics team to gain insights from their user and song data.
The project relies on two datasets stored in AWS S3:
- Song Data: Metadata about songs and artists, stored in JSON format in the path
s3://udacity-dend/song_data
. - Log Data: User activity logs generated by the Sparkify app, stored in JSON format in the path
s3://udacity-dend/log_data
.
Additionally, the JSON metadata file s3://udacity-dend/log_json_path.json
specifies how the log data is structured, enabling proper parsing during data loading into staging tables.
The song dataset consists of JSON files partitioned by the first three letters of each song’s track ID. For example, here are file paths to two files in the song dataset:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
Below is is an example of what the single song file, TRAABJL12903CDCF1A.json, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The log dataset comprises log files in JSON format that are partitioned by year and month. For example, here are file paths to two files in the dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And this image shows what the data in the log file, 2018-11-12-events.json, looks like:
These datasets are processed and transformed into a star-schema data model in Redshift, as shown in the Entity Relationship Diagram (ERD) below, consisting of fact and dimension tables to facilitate analysis.
The repository is structured as follows:
Sparkify-ETL-Pipeline/
├── 0_launch_Redshift_cluster.ipynb
├── 1_create_tables.py
├── 2_etl.py
├── 3_test_dimensional_model.ipynb
├── sql_queries.py
├── dwh.cfg
├── .gitignore
├── README.md
└── LICENSE
- 0_launch_Redshift_cluster.ipynb: Jupyter notebook that sets up and configures an Amazon Redshift cluster used in the ETL process.
- 1_create_tables.py: Python script responsible for creating the staging, fact, and dimension tables in the Redshift database.
- 2_etl.py: Python script that extracts data from S3, loads it into staging tables on Redshift, and then transforms it into the target fact and dimension tables.
- 3_test_dimensional_model.ipynb: Jupyter notebook used for testing and verifying the data loading process, validating the schema, and running analytic queries.
- sql_queries.py: This file contains all the SQL queries required for creating tables and performing the ETL operations.
- dwh.cfg: Configuration file that stores Redshift cluster, database, and AWS credentials.
- .gitignore: Specifies files and directories for Git to ignore, helping to manage sensitive data and unnecessary files.
- README.md: Provides an overview and instructions for this repository.
- LICENSE: The license file for the project.
- Launch Redshift Cluster: First, configure and launch the Redshift cluster using the
0_launch_Redshift_cluster.ipynb
notebook. This step sets up the target database on Redshift. - Create Tables: Run
1_create_tables.py
to create the staging, fact, and dimension tables in Redshift. This script can be rerun to reset the database if needed. - Run ETL Pipeline: Execute
2_etl.py
to load data from S3 into the staging tables in Redshift using theCOPY
command, and then insert the data into the fact and dimension tables using the staging tables. - Test Dimensional Model: Use
3_test_dimensional_model.ipynb
to validate the schema, check row counts, and run analytic queries to ensure that the model is ready for analytical workloads. - Tear Down Cluster: After completing the project, return to the final step in
1_create_tables.py
to delete the Redshift cluster and clean up associated resources.
Contributions to this project are welcome. If you'd like to improve the ETL pipeline or add additional functionalities, please fork the repository, create a new branch, and submit a pull request. Ensure that your code follows best practices and is well documented.
This project is licensed under the MIT License. Feel free to use, modify, and distribute the application in accordance with the terms of the license.
Special thanks to Udacity for providing the datasets and project specifications. The song and log data used in this project come from the Million Song Dataset and event simulator logs provided by Udacity.