Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

This repository contains the source code implementation of the NSDI '23 paper Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning.

We built our implementation atop Gavel, the open-sourced codebase of the OSDI '20 paper Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. We would like to thank the Gavel authors for open-sourcing their implementation!

Release notes

Sep 2022: We have released the first version of Shockwave! Please see the documentation below to get started. In the upcoming months, we will gradually make the following updates:

Add shell scripts and documentation for deploying Shockwave on a physical cluster
Add shell scripts and documentation for more simulation experiments
Add bibtex information and hyperlinks for the arXiv release
Make cleanups to the Shockwave codebase for better readability
Add plotting scripts

Directory Structure

`scheduler`

Code for the scheduler, including the scheduling mechanism and simulator (scheduler.py), implementations of scheduling policies (policies/), GavelIterator as a Python module, and a communication stack between the scheduler and workers that uses gRPC (runtime/).

`workloads`

Implementations of target workloads in PyTorch, including changes needed to integrate with the GavelIterator.

`accordion_workloads` and `gns_workloads`

Workload scripts built on top of those in workloads, with respective dynamic adaptation optimizations implemented, namely Accordion and Gradient Noise Scale (GNS).

Setting up the Software Dependencies

Shockwave/Gavel is implemented in Python. We have tested Shockwave/Gavel on Ubuntu 18.04 with Python 3.6.9. Python can be installed using Miniconda.

Required software dependencies can be installed using:

apt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev
pip install -r scheduler/requirements.txt
cd scheduler; make

In addition to the software dependencies required to run Gavel, running Shockwave also requires the Gurobi Optimizer. An academic license can be requested here. Note that you might need to connect to your university's network or use a VPN to download Gurobi. Please see the Gurobi website for more details.

Getting Started

Gavel's policies (including Shockwave) and scheduling mechanism can be evaluated either in simulation or on a physical cluster.

To reproduce our canonical results in simulation in ~10 minutes, run scheduler/reproduce/tacc_32gpus.sh. For detailed instructions on how to reproduce more results from the NSDI paper, see EXPERIMENTS.md.

References

@misc{zheng2022shockwave,
      title={Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning}, 
      author={Pengfei Zheng and Rui Pan and Tarannum Khan and Shivaram Venkataraman and Aditya Akella},
      year={2022},
      eprint={2210.00093},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
accordion_workloads/pytorch		accordion_workloads/pytorch
gns_workloads/pytorch		gns_workloads/pytorch
scheduler		scheduler
workloads		workloads
.gitignore		.gitignore
EXPERIMENTS.md		EXPERIMENTS.md
LICENSE		LICENSE
README.md		README.md
canonical.png		canonical.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Release notes

Directory Structure

`scheduler`

`workloads`

`accordion_workloads` and `gns_workloads`

Setting up the Software Dependencies

Getting Started

References

About

Releases

Packages

Contributors 3

Languages

License

uw-mad-dash/shockwave

Folders and files

Latest commit

History

Repository files navigation

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Release notes

Directory Structure

scheduler

workloads

accordion_workloads and gns_workloads

Setting up the Software Dependencies

Getting Started

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`scheduler`

`workloads`

`accordion_workloads` and `gns_workloads`

Packages