EMR, Spark, & Jupyter

In this tutorial, I'm going to setup a data environment with Amazon EMR, Apache Spark, and Jupyter Notebook. Apache Spark has gotten extremely popular for big data processing and machine learning and EMR makes it incredibly simple to provision a Spark Cluster in minutes! At Mozilla we frequently spin up Spark clusters to perform data analysis and we have a repository for scripts for provisioning our clusters. The scripts contained in my repository extract the functionality that is specific to creating a simple Spark cluster and installing Jupyter Notebook on the main node of the cluster.

Assumptions

The major assumption that I make in the following tutorial is that your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are accessible to awscli. This can be solved by placing the following environmental variables in the environment file of your respective shell. There might be other solutions to this problem, but I personally use this solution.

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Setting Up Key Pair Using EC2

In order to access the cluster via the command line later, you need to generate a Key Pair to ssh into the main node. I haven't been able to figure out a way in which to create a key pair using the awscli and have it work with the remainder of the script. Thus, I recommend setting up a Key Pair using the EC2 User Guide. Make sure to place the private key in this directory in order to run the script.

Configuring the Script

The following two variables need to be altered based on your use case. They are found in install-jupyter-notebook and script.sh.

# bucket should be created and used on s3
SPARK_BUCKET="bucket-to-create-on-s3"
# name of the key pair ex: MyKeyPair
SPARK_KEY_PAIR="key-pair-created-in-first-step"

Configuring Jupyter Notebook

jupyter_notebook_config.py is used to configure Jupyter Notebook on the main node. As an example, this file can be altered to set a password for access to notebooks. Below is the code for this example.

from notebook.auth import passwd
# get a hashed password
passwd('password')

Running the Script

After following the above steps, you can run the script to provision the cluster using bash script.sh. Each command can also be run separately in your shell if that is preferred.

Accesing Jupyter

In order to forward the notebook server and access Jupyter, we invoke the command below. Make sure that the private key being used has the proper permissions before running the command (if you followed the AWS guide it should)!

ssh -L 8888:localhost:8888 hadoop@ec2-**-***-***-*.us-west-1.compute.amazonaws.com -i <key pair>.pem

Now we can open localhost:8888 in a web browser and access our Spark context as if it was running locally on our computer.

TODO

break script.sh file into a user configurable file and a template file
allow parameters?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
install-jupyter-notebook		install-jupyter-notebook
jupyter_notebook_config.py		jupyter_notebook_config.py
script.sh		script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMR, Spark, & Jupyter

Assumptions

Setting Up Key Pair Using EC2

Configuring the Script

Configuring Jupyter Notebook

Running the Script

Accesing Jupyter

TODO

About

Releases

Packages

Languages

License

cameres/emr-spark-jupyter

Folders and files

Latest commit

History

Repository files navigation

EMR, Spark, & Jupyter

Assumptions

Setting Up Key Pair Using EC2

Configuring the Script

Configuring Jupyter Notebook

Running the Script

Accesing Jupyter

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages