Running Spark on Amazon EC2

This guide describes how to get Spark running on an EC2 cluster. It assumes you have already signed up for Amazon EC2 account on the Amazon Web Services site.

Basic Process

Download Mesos using the instructions on the Mesos wiki.
Launch a Mesos EC2 cluster following the EC2 guide on the Mesos wiki. (Essentially, this involves setting some environment variables and running a Python script.)
Log into your EC2 cluster's master node using mesos-ec2 -k <keypair> -i <key-file> login cluster-name.
Go into the spark directory in root's home directory.
Optionally edit conf/spark-env.sh to add extra classes to SPARK_CLASSPATH. If you do this, you will also need to copy your classes to the same location to all nodes. You can use the script ~/mesos-ec2/copy-dir to do this (give it the name of a directory to copy). After you've copied the classes, make sure you also copy conf/spark-env.sh to all the nodes by running ~/mesos-ec2/copy-dir /root/spark/conf.
Run either spark-shell or another Spark program, setting the Mesos master to use to 1@<ec2-master-node>:5050. You can also find this master URL in the file ~/mesos-ec2/cluster-name in newer versions of Mesos.
Use the Mesos web UI at http://<ec2-master-node>:8080 to view the status of your job.

Using a Newer Spark

The Mesos EC2 machines may not come with the latest version of Spark. To use a newer version, you can run git pull to pull in /root/spark to pull in the latest version of Spark from git, and build it using make. You will also need to copy it to all the other nodes in the cluster using ~/mesos-ec2/copy-dir /root/spark.

Accessing Data in S3

Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form s3n://<id>:<secret>@<bucket>/path, where <id> is your Amazon access key ID and <secret> is your Amazon secret access key. Note that you should escape any / characters in the secret key as %2F. Full instructions can be found on the Hadoop S3 page.

In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.

https://github.com/apache/spark/blob/master/.asf.yaml/#6090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Spark on Amazon EC2

Basic Process

Using a Newer Spark

Accessing Data in S3

Clone this wiki locally