This repository contains the source code implementation of the NSDI '23 paper Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning.
We built our implementation atop Gavel, the open-sourced codebase of the OSDI '20 paper Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. We would like to thank the Gavel authors for open-sourcing their implementation!
Sep 2022: We have released the first version of Shockwave! Please see the documentation below to get started. In the upcoming months, we will gradually make the following updates:
- Add shell scripts and documentation for deploying Shockwave on a physical cluster
- Add shell scripts and documentation for more simulation experiments
- Add bibtex information and hyperlinks for the arXiv release
- Make cleanups to the Shockwave codebase for better readability
- Add plotting scripts
Code for the scheduler, including the scheduling mechanism and simulator (scheduler.py
), implementations of scheduling policies (policies/
), GavelIterator
as a Python module, and a communication stack between the scheduler and workers that uses gRPC (runtime/
).
Implementations of target workloads in PyTorch, including changes needed to integrate with the GavelIterator
.
Workload scripts built on top of those in workloads
, with respective dynamic adaptation optimizations implemented, namely Accordion and Gradient Noise Scale (GNS).
Shockwave/Gavel is implemented in Python. We have tested Shockwave/Gavel on Ubuntu 18.04 with Python 3.6.9. Python can be installed using Miniconda.
Required software dependencies can be installed using:
apt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev
pip install -r scheduler/requirements.txt
cd scheduler; make
In addition to the software dependencies required to run Gavel, running Shockwave also requires the Gurobi Optimizer. An academic license can be requested here. Note that you might need to connect to your university's network or use a VPN to download Gurobi. Please see the Gurobi website for more details.
Gavel's policies (including Shockwave) and scheduling mechanism can be evaluated either in simulation or on a physical cluster.
To reproduce our canonical results in simulation in ~10 minutes, run scheduler/reproduce/tacc_32gpus.sh
. For detailed instructions on how to reproduce more results from the NSDI paper, see EXPERIMENTS.md.
@misc{zheng2022shockwave,
title={Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning},
author={Pengfei Zheng and Rui Pan and Tarannum Khan and Shivaram Venkataraman and Aditya Akella},
year={2022},
eprint={2210.00093},
archivePrefix={arXiv},
primaryClass={cs.DC}
}