This Docker image helps to run the Spark in a cluster mode with a master and variable slave (worker) nodes.
- Setup Docker and docker-compose first
- Build the image using included Dockerfile
docker-compose build
- Spin up a Spark cluster with 1 master and 2 slaves (as an example)
docker-compose up --scale master=1 --scale slave=2
- Verify that the cluster is running by going to http://docker-machine-ip:8080. Note: if you are running Docker on OS X or Windows, replace localhost with the docker host VM IP address. You can get the IP address by running
docker-machine ip
. - Verify that Jupyter notebook server is running by going to http://docker-machine-ip:8888
- Destroy the cluster
docker-compose down
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<docker machine IP>:7077")
conf.setAppName('test')
sc = pyspark.SparkContext(conf=conf)
rdd = sc.parallelize(range(100))
print(rdd.reduce(lambda x,y: x+y))
Need to add support for the following components:
- Scala
- PySpark
- HDFS
- Zeppelin
- Jupyter
- Instructions on setting up in Azure/AWS with Docker Swarm
- Run containers in some kind of process manager