This project implements stochastic gradient descent for support vector machines (SVM) on Scala Spark. It was done for the course Systems for Data Science @ EPFL, 2019.
You need a running bash
, sbt
and kubectl
.
Download those with your favorite installer like apt-get
on Ubuntu or brew
on macOS.
The script will itself download and install Spark 2.4 in the folder deploy
.
To run the code locally, the data set needs to be downloaded. You can use this script to download the data.
./download-data.sh
In case you want to run the code on a Kubernetes cluster, it's not necessary to download the data, since the complete data set already needs to be present in a PersistentVolume.
The ./run.sh
exposes 5 different parameters:
-w
where to run the program:local
or in thecluster
. [default local]-n
the number of executor instances that will be allocated [default 5]-s
the subset size that is sampled by SGD [default 1000]-e
the number of epochs the algorithm is maximally run [default 1000]-p
the name of the executed pod on Kubernetes [default pod]
For example:
./run.sh -w cluster -n 5 -s 50 -e 1000 -p mypod
runs the code on the cluster, with 5 workers, subset size 50, 1000 epochs and 'mypod' as pod name.
This project is based on the excellent work of the hogwild-python
implementation by liabifano
that we forked here.