Spark Cluster with Docker & docker-compose(2021 ver.)

General

A simple spark standalone cluster for your testing environment purposses. A docker-compose up away from you solution for your spark development environment.

The Docker compose will create the following containers:

container	Exposed ports
spark-master	9090 7077
spark-worker-1	9091
spark-worker-2	9092
demo-database	5432

Installation

The following steps will make you run your spark cluster's containers.

Pre requisites

Docker installed
Docker compose installed

Build the image

docker build -t cluster-apache-spark:3.0.2 .

Run the docker-compose

The final step to create your test cluster will be to run the compose file:

docker-compose up -d

Validate your cluster

Just validate your cluster accesing the spark UI on each worker & master URL.

Resource Allocation

This cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM & cpu cores allocation).

The default CPU cores allocation for each spark worker is 1 core.
The default RAM for each spark-worker is 1024 MB.
The default RAM allocation for spark executors is 256mb.
The default RAM allocation for spark driver is 128mb
If you wish to modify this allocations just edit the env/spark-worker.sh file.

Binded Volumes

To make app running easier I've shipped two volume mounts described in the following chart:

Host Mount	Container Mount	Purposse
apps	/opt/spark-apps	Used to make available your app's jars on all workers & master
data	/opt/spark-data	Used to make available your app's data on all workers & master

This is basically a dummy DFS created from docker Volumes...(maybe not...)

Run Sample applications

NY Bus Stops Data [Pyspark]

This programs just loads archived data from MTA Bus Time and apply basic filters using spark sql, the result are persisted into a postgresql table.

The loaded table will contain the following structure:

latitude	longitude	time_received	vehicle_id	distance_along_trip	inferred_direction_id	inferred_phase	inferred_route_id	inferred_trip_id	next_scheduled_stop_distance	next_scheduled_stop_id	report_hour	report_date
40.668602	-73.986697	2014-08-01 04:00:01	469	4135.34710710144	1	IN_PROGRESS	MTA NYCT_B63	MTA NYCT_JG_C4-Weekday-141500_B63_123	2.63183804205619	MTA_305423	2014-08-01 04:00:00	2014-08-01

To submit the app connect to one of the workers or the master and execute:

/opt/spark/bin/spark-submit --master spark://spark-master:7077 \
--jars /opt/spark-apps/postgresql-42.2.22.jar \
--driver-memory 1G \
--executor-memory 1G \
/opt/spark-apps/main.py

MTA Bus Analytics[Scala]

This program takes the archived data from MTA Bus Time and make some aggregations on it, the calculated results are persisted on postgresql tables.

Each persisted table correspond to a particullar aggregation:

Table	Aggregation
day_summary	A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles)
speed_excesses	Speed excesses calculated in a 5 minute window
average_speed	Average speed by vehicle
distance_traveled	Total Distance traveled by vehicle

To submit the app connect to one of the workers or the master and execute:

/opt/spark/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--total-executor-cores 1 \
--class mta.processing.MTAStatisticsApp \
--driver-memory 1G \
--executor-memory 1G \
--jars /opt/spark-apps/postgresql-42.2.22.jar \
--conf spark.driver.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
/opt/spark-apps/mta-processing.jar

You will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)

Summary

We compiled the necessary docker image to run spark master and worker containers.
We created a spark standalone cluster using 2 worker nodes and 1 master node using docker && docker-compose.
Copied the resources necessary to run demo applications.
We ran a distributed application at home(just need enough cpu cores and RAM to do so).

Why a standalone cluster?

This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.
This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)

Steps to connect and use a pyspark shell interactively

Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker.

docker-compose up --scale spark-worker=1
docker exec -it docker-spark-cluster_spark-worker_1 bash
apt update
apt install python3-pip
pip3 install pyspark
pyspark

What's left to do?

Right now to run applications in deploy-mode cluster is necessary to specify arbitrary driver port.
The spark submit entry in the start-spark.sh is unimplemented, the submit used in the demos can be triggered from any worker

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
apps		apps
articles		articles
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
start-spark.sh		start-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Cluster with Docker & docker-compose(2021 ver.)

General

Installation

Pre requisites

Build the image

Run the docker-compose

Validate your cluster

Spark Master

Spark Worker 1

Spark Worker 2

Resource Allocation

Binded Volumes

Run Sample applications

NY Bus Stops Data [Pyspark]

MTA Bus Analytics[Scala]

Summary

Why a standalone cluster?

Steps to connect and use a pyspark shell interactively

What's left to do?

About

Releases

Packages

Contributors 6

Languages

mvillarrealb/docker-spark-cluster

Folders and files

Latest commit

History

Repository files navigation

Spark Cluster with Docker & docker-compose(2021 ver.)

General

Installation

Pre requisites

Build the image

Run the docker-compose

Validate your cluster

Spark Master

Spark Worker 1

Spark Worker 2

Resource Allocation

Binded Volumes

Run Sample applications

NY Bus Stops Data [Pyspark]

MTA Bus Analytics[Scala]

Summary

Why a standalone cluster?

Steps to connect and use a pyspark shell interactively

What's left to do?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages