Crossbow is a multi-GPU system for training deep learning models that allows users to choose freely their preferred batch size, however small, while scaling to multiple GPUs.
Crossbow utilises modern GPUs better than other systems by training multiple model replicas on the same GPU. When the batch size is sufficiently small to leave GPU resources unused, Crossbow trains a second model replica, a third, etc., as long as training throughput increases.
To synchronise many model replicas, Crossbow uses synchronous model averaging to adjust the trajectory of each individual replica based on the average of all. With model averaging, the batch size does not increase linearly with the number of model replicas, as it would with synchronous SGD. This yields better statistical efficiency without cumbersome hyper-parameter tuning when trying to scale training to a larger number of GPUs.
See our VLDB 2019 paper for more details.
The system supports a variety of training algorithms, including synchronous SGD. We are working to seemlesly port existing TensorFlow models to Crossbow.
Crossbow has been primarily tested on Ubuntu Linux 16.04. It requires the following Linux packages:
$ sudo apt-get install build-essential git openjdk-8-jdk maven libboost-all-dev graphviz wget
Crossbow requires NVIDIA's CUDA toolkit, the cuDDN library and the NCCL library (currently using versions 8.0, 6.0, and 2.1.15, respectively). After successful installation, make sure that:
CUDA_HOME
is set (the default location is/usr/local/cuda
)NCCL_HOME
is set
and that:
PATH
includes$CUDA_HOME/bin
andLD_LIBRARY_PATH
includes$CUDA_HOME/lib64
and$NCCL_HOME/lib
Crossbow also requires the OpenBLAS and libjpeg-turbo libraries. After successful installation, make sure that:
BLAS_HOME
is set (the default location is/opt/OpenBLAS
)JPEG_HOME
is set
and that:
LD_LIBRARY_PATH
includes$BLAS_HOME/lib
and$JPEG_HOME/lib
Crossbow uses page-locked memory regions to speed up data transfers from CPU to GPU and vice versa. The amount of memory locked by the system usually exceeds the default OS limit. Edit /etc/security/limits.conf
and append the following lines to the end of the file:
* hard memlock unlimited
* soft memlock unlimited
Save changes and reboot the machine.
Assuming all enviromental variables have been set, build Crossbow's Java and C/C++ library:
$ git clone http://github.com/lsds/Crossbow.git
$ cd Crossbow
$ export CROSSBOW_HOME=`pwd`
$ ./scripts/build.sh
Note: We will shortly add an installation script as well as a Docker image to simplify the installation process and avoid library conflicts.
Crossbow serialises ImageNet images and their labels into a binary format similar to TensorFlow's TFRecord. Follow TensorFlow's instructions to download and convert the dataset to TFRecord format. You will end up with 1,024 training and 128 validation record files in a directory of your choice (say, /data/imagenet/tfrecords
). Then, run:
$ cd $CROSSBOW_HOME
$ ./scripts/datasets/imagenet/prepare-imagenet.sh /data/imagenet/tfrecords /data/imagenet/crossbow
The script will convert TensorFlow's record files to Crossbow's own binary format and store them in /data/imagenet/crossbow
. You are now ready to train ResNet-50 with the ImageNet data set:
$ ./scripts/benchmarks/resnet-50.sh
The first script downloads the MNIST data set and converts it to Crossbow's binary record format. Output files are written in $CROSSBOW_HOME/data/mnist/b-001
and they are tailored to a specific batch size (in this case, 1). The second script will train LeNet with the MNIST data set.
$ cd $CROSSBOW_HOME
$ ./scripts/datasets/mnist/prepare-mnist.sh
$ ./scripts/benchmarks/lenet.sh
Crossbow supports the entire ResNet family of neural networks. It also supports VGG-16 based on the implementation here. It supports the convnet-benchmarks suite of micro-benchmarks too.
Note: We will shortly add a page describing how to configure Crossbow's system parameters.
Crossbow represents a deep learning application as a data flow graph: nodes
represent operations and edges the data (multi-dimensional arrays, also known
as tensors) that flow among them. The most notable operators are
inner-product, pooling, convolutional layers and activation functions. Some of these operators have learnable parameters (also multi-dimensional arrays) that form part of the model being trained. An inner-product operator, for example, has two learnable parameters, weights
and bias
:
InnerProductConf conf = new InnerProductConf ();
/* Let's assume that there are 10 possible output labels, as in MNIST */
conf.setNumberOfOutputs (10);
/* Initialise weights with values drawn a random Gaussian distribution;
* and all of bias elements with the same value */
conf.setWeightInitialiser (new InitialiserConf ().setType (InitialiserType.GAUSSIAN).setStd(0.1F));
conf.setBiasInitialiser (new InitialiserConf ().setType (InitialiserType.CONSTANT).setValue(1F));
/* Create inner-product operator and wrap it in a graph node */
Operator op = new Operator ("InnerProduct", new InnerProduct (conf));
DataflowNode innerproduct = new DataflowNode (op);
Connect data flow nodes together to form a neural network. For example, we can connect the forward layers of a logistic regression model:
innerproduct.connectTo(softmax).connectTo(loss);
At the end, we can construct our model and train it for 1 epoch:
SubGraph subgraph = new SubGraph (innerproduct);
Dataflow dataflow = new Dataflow (subgraph).setPhase(Phase.TRAIN);
ExecutionContext context = new ExecutionContext (new Dataflow [] { dataflow, null });
context.init();
context.train(1, TrainingUnit.EPOCHS);
The full source code is available here.