Skip to content

GetStarted_Standalone

leewyang edited this page Jan 22, 2019 · 19 revisions

Running TensorFlowOnSpark on a Spark Standalone cluster (Single Host)

We illustrate how to use TensorFlowOnSpark on a Spark Standalone cluster running on a single machine. Note that TensorFlowOnSpark does not work in Spark Local (single-process) mode, since it expects the executors to be running in separate processes.

Clone TensorFlowOnOnSpark code

git clone https://github.com/yahoo/TensorFlowOnSpark.git
cd TensorFlowOnSpark
export TFoS_HOME=$(pwd)

Install Spark

Download Apache Spark per instructions. When done, make sure you set the following environment variables:

export SPARK_HOME=<path to Spark>
export PATH=${SPARK_HOME}/bin:${PATH}

Install TensorFlow and TensorFlowOnSpark

Please build and install TensorFlow per instructions.

For example, using the pip install method, you should be able to install TensorFlow as follows:

sudo pip install tensorflow
sudo pip install tensorflowonspark

To view the installed packages:

pip list

Download MNIST data

mkdir ${TFoS_HOME}/mnist
pushd ${TFoS_HOME}/mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
popd
export MASTER=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER}

Go to MasterWebUI, make sure that you have the exact number of workers launched.

Test Pypark, TensorFlow, and TensorFlowOnSpark

Start a pyspark shell and import tensorflow and tensorflowonspark. If everything is setup correctly, you shouldn't see any errors.

pyspark
>>> import tensorflow as tf
>>> from tensorflowonspark import TFCluster
>>> exit()

Convert the MNIST zip files using Spark

cd ${TFoS_HOME}
# rm -rf examples/mnist/csv
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
--output examples/mnist/csv \
--format csv
ls -lR examples/mnist/csv

Run distributed MNIST training (using feed_dict)

Note: In some environments, you may need to add the paths to libcuda*.so, libjvm.so, and libhdfs.so libraries to spark.executorEnv.LD_LIBRARY_PATH and set spark.executorEnv.CLASSPATH=$(hadoop classpath --glob).

# rm -rf mnist_model
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

ls -l mnist_model

Run distributed MNIST inference (using feed_dict)

# rm -rf predictions
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

less predictions/part-00000

The prediction result should look like:

2017-02-10T23:29:17.009563 Label: 7, Prediction: 7
2017-02-10T23:29:17.009677 Label: 2, Prediction: 2
2017-02-10T23:29:17.009721 Label: 1, Prediction: 1
2017-02-10T23:29:17.009761 Label: 0, Prediction: 0
2017-02-10T23:29:17.009799 Label: 4, Prediction: 4
2017-02-10T23:29:17.009838 Label: 1, Prediction: 1
2017-02-10T23:29:17.009876 Label: 4, Prediction: 4
2017-02-10T23:29:17.009914 Label: 9, Prediction: 9
2017-02-10T23:29:17.009951 Label: 5, Prediction: 6
2017-02-10T23:29:17.009989 Label: 9, Prediction: 9
2017-02-10T23:29:17.010026 Label: 0, Prediction: 0

Interactive Learning with Jupyter Notebook

Install additional software required by Jupyter Notebooks.

sudo pip install jupyter

Launch Jupyter notebook on Master node.

pushd ${TFoS_HOME}/examples/mnist
PYSPARK_DRIVER_PYTHON="jupyter" \
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
pyspark  --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME"

This should launch Jupyter in a browser. Open any of the example notebooks and follow the instructions within.

Shutdown Spark cluster

${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh

Running TensorFlowOnSpark on a Spark Standalone cluster (Multiple Host)

For multi-host Spark Standalone clusters, you will still need some form of distributed filesystem that spans the multiple hosts, which in many cases, is HDFS. If your setup uses HDFS, TensorFlow requires you to add the path to the libhdfs.so file to your LD_LIBRARY_PATH in order for it to read/write files to HDFS. This can be done by adding the following config to your spark-submit commands:

--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_HDFS

where you can set LIB_HDFS to the path to libhdfs.so on your setup.

Clone this wiki locally