In this tutorial, I'm going to setup a data environment with Amazon EMR, Apache Spark, and Jupyter Notebook. Apache Spark has gotten extremely popular for big data processing and machine learning and EMR makes it incredibly simple to provision a Spark Cluster in minutes! At Mozilla we frequently spin up Spark clusters to perform data analysis and we have a repository for scripts for provisioning our clusters. The scripts contained in my repository extract the functionality that is specific to creating a simple Spark cluster and installing Jupyter Notebook on the main node of the cluster.
The major assumption that I make in the following tutorial is that your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are accessible to awscli
. This can be solved by placing the following environmental variables in the environment file of your respective shell. There might be other solutions to this problem, but I personally use this solution.
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
In order to access the cluster via the command line later, you need to generate a Key Pair to ssh into the main node. I haven't been able to figure out a way in which to create a key pair using the awscli
and have it work with the remainder of the script. Thus, I recommend setting up a Key Pair using the EC2 User Guide. Make sure to place the private key in this directory in order to run the script.
The following two variables need to be altered based on your use case. They are found in install-jupyter-notebook
and script.sh
.
# bucket should be created and used on s3
SPARK_BUCKET="bucket-to-create-on-s3"
# name of the key pair ex: MyKeyPair
SPARK_KEY_PAIR="key-pair-created-in-first-step"
jupyter_notebook_config.py
is used to configure Jupyter Notebook on the main node. As an example, this file can be altered to set a password for access to notebooks. Below is the code for this example.
from notebook.auth import passwd
# get a hashed password
passwd('password')
After following the above steps, you can run the script to provision the cluster using bash script.sh
. Each command can also be run separately in your shell if that is preferred.
In order to forward the notebook server and access Jupyter, we invoke the command below. Make sure that the private key being used has the proper permissions before running the command (if you followed the AWS guide it should)!
ssh -L 8888:localhost:8888 hadoop@ec2-**-***-***-*.us-west-1.compute.amazonaws.com -i <key pair>.pem
Now we can open localhost:8888 in a web browser and access our Spark context as if it was running locally on our computer.
- break
script.sh
file into a user configurable file and a template file - allow parameters?