As Sparkify continues to have more and more users and it would be very beneficial to migrate from a datawarehouse to a data lake (please check the data modeling series in my github repos so you can follow up). This work aims to design an ETL data pipeline using Spark that extracts and store output files in S3 as a data Lake. The purpose of this project is to acquire new insights manipulating the easy-to-use data residing in parquet files in S3.
The solution is to create a star schema of fact and dimension tables optimized for queries on songplays analysis. To do that, we need to extract data which resides in S3 buckets, then load data into spark dataframes on EMR clusters,transform it into dimensional tables and finally store them back into S3 parquet files. In this case we are dealing with raw data type in JSON format that are located into two different S3 buckets. here are links for each :
s3://udacity-dend/song_data
.s3://udacity-dend/log_data
.
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here is a filepath to a file in this dataset: song_data/A/B/C/TRABCEI128F424C983.json
The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset we will be using are partitioned by year and month. For example, here is a filepath to two file in this dataset: log_data/2018/11/2018-11-12-events.json
Please refer to this repo to get the schema as it is the same data schema as before.
There are several ways to configure and launch an EMR cluster, either programmatically or simply using AWS Console. For the sake of simplicity I will use the AWS Console to set up an EMR cluster.
Here my cluster's configuration :
3 EC2 instances ( 1 Master and 2 Slaves) of type m5.xlarge, 16 GiB memory, 64 GiB EBS storage each.
You can choose whatever configuration you want. The jobs took 12 minutes to finish.
So you can understand that your application time running will depend on the EMR configuration you set and so many else factors( like your internet connectivity speed)
Important note, I set up the EMR cluster and S3 in the same region, this way It would save me time.
- Create an IAM user with a programmatic access type, attach existing policies and set permission as Administrator Access, then choose Next: Tags, skip the tag page and choose Next:Review and choose Create User. Finally save credentials into a safe place.
- From the AWS console, click on Service, type 'EC2' to go to EC2 console, Choose Key Pairs in Network & Security on the left panel => Choose Create key pair, Type the name for the key pairs, File format: pem => Choose Create key pair . After this step, a .pem file will be automatically downloaded into your computer and will be used later.
- Launch an EMR cluster with the configuration you choose (either using the AWS Console or AWS CLI) and assign the key pair you created before (pem file).
- Edit your inbound rules in the security group and allow SSH access to your computer IP adress.
- Open up a terminal and SSH to your cluster (please refer to summary page of your EMR cluster=> connect to Master Node using SSH under Master public DNS)
- Once connected to cluster via your terminal create a new file ( nano etl.py ) and copy paste the code in the etl.py I provided in this work. Or you can SCP your file from your local machine to your host machine (please refer to the txt file in the repoto be able to copy your file).
- Create a new file called dl.cfg, copy and paste the content of my dl.cfg I provided and replace the blanks with your own credentials !! DO NOT explicitily plain write your credentials into to etl.py, do it my way and let the configParser do the job.
- Submit your etl.py using the command ( spark-submit --master yarn etl.py)
- Depending on your EMR cluster configuration, the execution could take a while ( 12 minutes in my case )
- You can monitor your job with SPARK UI ( you must set up a dynamic port forwarding first -- for more info please check the AWS documentation on PORT FORWARDING).
- For data quality you can launch a jupyter notebook on the EMR cluster and start playing with the dimentional table created in this work
- etl.py : a python file you will be using in your EMR cluster !! YOU NEED TO CHANGE THE "output_file" with your own S3 outputfile
- df.cfg : the configuration file that contains the IAM user ( keep the file's name & structure I use and replace the blanks with your own credentials)
- scp-method.txt : to copy your local files into your host machine