A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As a data engineer, we are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
We'll be able to test your database and ETL pipeline by running queries given to us by the analytics team from Sparkify and compare your results with their expected results.
For the purpose of this exercise our chosen AWS region is us-west-2.
Follow the following steps to reproduce result.
Ensure you provide your AWS Access Key ID and Secret Key in dl.cfg
. Do not surround the strings with quotes (as quotes can cause problems here).
[AWS]
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
(Note: in production environment, we would normally store these details in a more secure location, rather than at the same directory as the python script / working directory)
This step is purely for understanding. To view the public S3 bucket udacity-dend
(where the input song and log files are stored) via S3 console, open in browser:
https://s3.console.aws.amazon.com/s3/buckets/udacity-dend?region=us-west-2&tab=objects#
See song-data
and log-data
for the input JSON files.
Go to S3 console: https://s3.console.aws.amazon.com/s3/buckets?region=us-west-2
Create Bucket.
- Bucket Name (whatever name that is unique):
sparkify-dl-20220911t113300
- AWS Region:
us-west-2
- Object Ownership (use default): ACLs disabled (Recommended)
- Block Public Access settings for this bucket (use default): checked (block all public access)
- Bucket versioning (use default): disable
- Tag (use default): (ignore)
- Server-side encryption (use default): Disable
- Object Lock (use default): Disable
Run python etl.py
via one of the following two options:
Option 1: run locally via Udacity Workspace environment. The processes that involve reading-from and writing-to S3 bucket may take longer time, as there is a longer distance for data to tranvel between S3 location and the Udacity workspace server. The initial development uses the Udacity workspace environment (by restricting only reading small amount of files. e.g. read just one log file: s3a://udacity-dend/log_data/2018/11/2018-11-12-events.json
and small number of song files: s3a://udacity-dend/song-data/*/*/*/*.json
). (to learn how the )
Option 2: run on the AWS EMR cluster. Launch a AWS EMR Cluster as per instructions in Appendix 2.
Once the Python ETL script has completed running, we should see the following datasets (stored in their own folder):
-
staing_songs
- effectively a dataset populated from the rawsong-data
JSON files. This is to make downstream investigation and development work easier. -
staging_events
- effectively a dataset populated from the rawlog-data
JSON files. This is to make downstream investigation and development work easier.
songplays
- records in log data associated with song plays i.e. records with page NextSong. Fields: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
-
users - users in the app. Fields: user_id, first_name, last_name, gender, level
-
songs - songs in music database. Fields: song_id, title, artist_id, year, duration
-
artists - artists in music database. Fields: artist_id, name, location, lattitude, longitude
-
time - timestamps of records in songplays broken down into specific units. start_time, hour, day, week, month, year, weekday
Due to time constraint we have not done this yet. But if we are to do this, we can read the parquet datasets using spark and do some basic analytics - such as table joins and aggregations.
To view the public S3 bucket udacity-dend
via S3 console, open in browser:
https://s3.console.aws.amazon.com/s3/buckets/udacity-dend?region=us-west-2&tab=objects#
To list XML of the bucket:
https://udacity-dend.s3.us-west-2.amazonaws.com/
To view a log-data JSON file. e.g.
https://udacity-dend.s3.us-west-2.amazonaws.com/log-data/2018/11/2018-11-01-events.json
To view a song-data JSON file. e.g.
https://udacity-dend.s3.us-west-2.amazonaws.com/song-data/A/A/A/TRAAAAK128F9318786.json
The following steps are provided by Udacity.
Go to the Amazon EMR Console.
Select "Clusters" in the menu and click the "Create cluster" button.
- Cluster: (something meaningful. In our case:
sparkify-emr-20220911t113300
) - Release:
emr-5.20.0
(NOTE: stick with this version exact. I triedemr-5.33
it gives error:Cluster does not have Jupyter Enterprise Gateway application installed
. Udacity forum suggests sticking withemr-5.20.0
resolves this issue) - Applications:
Spark: Spark 2.4.0 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.0
- Instance type: m3.xlarge
- Number of instance: 3
- EC2 key pair: Proceed without an EC2 key pair or feel free to use one if you'd like
Keep the remaining default setting and click "Create cluster" on the bottom right.
It may take some time for the cluster to spin up.
Once you create the cluster, you'll see a status next to your cluster name that says Starting. Wait a short time for this status to change to Waiting before moving on to the next step.
Now that you launched your cluster successfully, let's create a notebook to run Spark on that cluster.
Select "Notebooks" in the menu on the left, and click the "Create notebook" button.
Enter a name for your notebook
Select "Choose an existing cluster" and choose the cluster you just created
Use the default setting for "AWS service role" - this should be "EMR_Notebooks_DefaultRole" or "Create default role" if you haven't done this before. Note: if you use "Create default Role" it may return an error message. Ignore that error and click create notebook - it should go through.
You can keep the remaining default settings and click "Create notebook" on the bottom right.
Once you create an EMR notebook, you'll need to wait a short time before the notebook status changes from Starting or Pending to Ready. Once your notebook status is Ready, click the "Open" button to open the notebook.
Once Notebook is opened, change the Kernal (Kernal -> Change Kernal -> PySpark)
Open Question: do we need to update this part of the code (the .config()
) part). (Answer: probably not)
def create_spark_session():
"""
create a spark session
"""
return SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.getOrCreate()
Create S3 Bucket:
- https://knowledge.udacity.com/questions/762997
- https://knowledge.udacity.com/questions/762290
- https://knowledge.udacity.com/questions/121606
- https://knowledge.udacity.com/questions/101153 - implies we don't need to make S3 bucket public (as we are going to delete it at the end of exercise anyway
- https://knowledge.udacity.com/questions/132987 - when write to S3 data-lake bucket, make sure we use correct version in spark config. Use
s3a
instead ofs3n
.