Assignment Distributed Training

In this assignment, we will explore how to implement the communication protocols for Data Parallel and Tensor Model Parallel training from scratch using Message Passing Interface (MPI) and NumPy.

As we will not be focusing on the actual forward computation or backward propagation, we have provided you with a template code that covers the forward and backward logics for you and leave only the communication part for you to implement.

Setup Guide

We will use the GHC clusters with multi-core machines for this assignment. To start with, you need to log into the GHC cluster ghc[X].ghc.andrew.cmu.edu where X is between 47 and 86 with your andrew_id and password:

ssh [andrew_id]@ghc[X].ghc.andrew.cmu.edu

Then you should clone this repo and setup your virtual environment:

git clone https://github.com/mlsyscourse/assignment-distributed-training.git
cd assignment-distributed-training
pip install virtualenv
python3 -m venv workspace
source workspace/bin/activate
pip install -r requirements.txt

Once you have set up your virtual environment, you can resume your working space next time by simply using:

ssh [andrew_id]@ghc[X].ghc.andrew.cmu.edu
cd assignment3
source workspace/bin/activate

You can exit the current virtual env by running

deactivate

You are not required to use the same [X]. In case some nodes are in maintenance, please switch to some other nodes.

You are not required to use the GHC machines as long as your preferred platform contains MPI support with at least 8 cores.

Part 0. Warm-up

The goal of this assignment is to walk you through a 2D parallel training pipeline step by step, which involves tensor model and data parallel training. For tensor model parallel training we further consider naive tensor model parallel and Megatron-style tensor model parallel.

To get familiar with our communication protocols we will start with playing around with the MPI package we have installed.

MPI Test

To verify that mpi4py has been setup correctly for distributed workloads, run:

mpirun -n 8 python mpi-test.py

For the GHC cluster machines, you can launch 8 processes at maximum with the -n argument. We also provide you with some toy examples of the MPI functions in mpi-test.py, including Allreduce(), Allgather(), Reduce_scatter(), Split(). Note that these four MPI functions are the only functions that are required and allowed by the assignment.

All-Reduce

You can see an all-reduce (op=min) example by running:

mpirun -n 8 python mpi-test.py --test_case allreduce

All-Gather

You can see an all-gather example by running:

mpirun -n 8 python mpi-test.py --test_case allgather

Reduce-Scatter

You can see a reduce-scatter example by running:

mpirun -n 8 python mpi-test.py --test_case reduce_scatter

Split

Split is especially helpful when you are trying to apply MPI functions on a group basis. You can see a split enabled group-wise reduction example by running:

mpirun -n 8 python mpi-test.py --test_case split

When playing with different test cases, try to get yourself familiar with the underline mpi functions and think about whether the output meets your expectation.

Node Indexing Specifications

With a given data and model parallel size, we will assign nodes in a model parallel major for this assignment. For instance, for mp_size=2, dp_size=4 on 8 nodes we will group the nodes as shown below:

Part 1. Data Split for Data Parallel Training (10 pts)

For this part, your task is to implement the split_train function in data/data_parallel_preprocess.py.

The function takes in the training data and returns the data split according to the given mp_size, dp_size and rank. You should split data uniformly across data parallel groups while the model parallel groups can share the same data split within the same data parallel group. The data length is guaranteed to be divided equally by the dp_size in all our test cases.

Hints: For mp_size=2, dp_size=4, you should split the data this way:

To test your implementation, please run

python3 -m pytest -l -v tests/test_data_split.py

Part 2. Layer Initialization (20 pts)

In this part, your task is to get necessary information for model and data parallel training, which is then used to initialize the corresponding layers in your model.

For this assignment we will work with a simple two layer perceptron model as shown below:

You are only required to implement the communications within two fully connective layers for forward and backward. We have already taken care of the other stuffs i.e. the forward/backward computations and the training pipeline as these are not relevant to the goal of this assignment.

For data parallel, we simply just split the batch of data equally across different data parallel groups:

For naive tensor model parallel training, we split the weight matrix of both the fully connective layers (fc1, fc2) along the output dimension (partition output) and shard them across different nodes. (Note that we don't shard different layers to different node as we don't consider pipeline parallelism here)

For Megatron-style tensor model parallel training, we split the weight matrix of FC1 across the output dimension (partition output) and the weight matrix of FC2 across the input dimension and shard them across different nodes (reduce output).

Given the above information, you need to implement the get_info function in model/func_impl.py. The function gets essential information for later parts, including model/data parallel indexing, model/data parallel communication groups, in/out dimensions for two FC layers. Please refers to the function for more information and hints.

To test your implementation, please run

mpirun -n 8 python3 -m pytest -l -v --with-mpi tests/test_get_info.py

Part 3. Naive Model Parallel Forward Communication (15 pts)

As communication only happens in the FC2 layer for the model we defined, your task in this part is to implement the forward communications in FC2 for the naive model parallel. You need to implement the naive_collect_forward_input and naive_collect_forward_output functions in model/func_impl.py, which corresponds to the two communication layers (A, B) shown above.