A simple spark standalone cluster for your testing environment purposses. A docker-compose up away from you solution for your spark development environment.
The Docker compose will create the following containers:
container | Exposed ports |
---|---|
spark-master | 9090 7077 |
spark-worker-1 | 9091 |
spark-worker-2 | 9092 |
demo-database | 5432 |
The following steps will make you run your spark cluster's containers.
-
Docker installed
-
Docker compose installed
docker build -t cluster-apache-spark:3.0.2 .
The final step to create your test cluster will be to run the compose file:
docker-compose up -d
Just validate your cluster accesing the spark UI on each worker & master URL.
This cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM & cpu cores allocation).
-
The default CPU cores allocation for each spark worker is 1 core.
-
The default RAM for each spark-worker is 1024 MB.
-
The default RAM allocation for spark executors is 256mb.
-
The default RAM allocation for spark driver is 128mb
-
If you wish to modify this allocations just edit the env/spark-worker.sh file.
To make app running easier I've shipped two volume mounts described in the following chart:
Host Mount | Container Mount | Purposse |
---|---|---|
apps | /opt/spark-apps | Used to make available your app's jars on all workers & master |
data | /opt/spark-data | Used to make available your app's data on all workers & master |
This is basically a dummy DFS created from docker Volumes...(maybe not...)
This programs just loads archived data from MTA Bus Time and apply basic filters using spark sql, the result are persisted into a postgresql table.
The loaded table will contain the following structure:
latitude | longitude | time_received | vehicle_id | distance_along_trip | inferred_direction_id | inferred_phase | inferred_route_id | inferred_trip_id | next_scheduled_stop_distance | next_scheduled_stop_id | report_hour | report_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|
40.668602 | -73.986697 | 2014-08-01 04:00:01 | 469 | 4135.34710710144 | 1 | IN_PROGRESS | MTA NYCT_B63 | MTA NYCT_JG_C4-Weekday-141500_B63_123 | 2.63183804205619 | MTA_305423 | 2014-08-01 04:00:00 | 2014-08-01 |
To submit the app connect to one of the workers or the master and execute:
/opt/spark/bin/spark-submit --master spark://spark-master:7077 \
--jars /opt/spark-apps/postgresql-42.2.22.jar \
--driver-memory 1G \
--executor-memory 1G \
/opt/spark-apps/main.py
This program takes the archived data from MTA Bus Time and make some aggregations on it, the calculated results are persisted on postgresql tables.
Each persisted table correspond to a particullar aggregation:
Table | Aggregation |
---|---|
day_summary | A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles) |
speed_excesses | Speed excesses calculated in a 5 minute window |
average_speed | Average speed by vehicle |
distance_traveled | Total Distance traveled by vehicle |
To submit the app connect to one of the workers or the master and execute:
/opt/spark/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--total-executor-cores 1 \
--class mta.processing.MTAStatisticsApp \
--driver-memory 1G \
--executor-memory 1G \
--jars /opt/spark-apps/postgresql-42.2.22.jar \
--conf spark.driver.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
/opt/spark-apps/mta-processing.jar
You will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)
-
We compiled the necessary docker image to run spark master and worker containers.
-
We created a spark standalone cluster using 2 worker nodes and 1 master node using docker && docker-compose.
-
Copied the resources necessary to run demo applications.
-
We ran a distributed application at home(just need enough cpu cores and RAM to do so).
-
This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.
-
This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)
- Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker.
docker-compose up --scale spark-worker=1
docker exec -it docker-spark-cluster_spark-worker_1 bash
apt update
apt install python3-pip
pip3 install pyspark
pyspark
-
Right now to run applications in deploy-mode cluster is necessary to specify arbitrary driver port.
-
The spark submit entry in the start-spark.sh is unimplemented, the submit used in the demos can be triggered from any worker