feat: all reduce bench slurm pyxis #101

OrenLeung · 2025-03-08T21:16:19Z

SLURM Pyxis Container Plugin allows for easy reproducible scripts where the environment is containerized as such the benchmark harness script does not rely on any host machine dependencies besides the standard SLURM, pyxis SLURM plugin and nvidia-drivers.

SLURM Pyxis Container Plugin is widely used across many companies and increasingly gaining adoption.

On CSPs that have enabled SLURM Pyxis Container Plugin, such as CoreWeave, Crusoe, Oracle, Azure, etc, all_reduce_bench.py can be easily ran & reproduced via the following command:

sbatch -n <num_of_nodes> ./all_reduce_bench_pyxis.sbatch

Note that on AWS & GCP, this launcher will work too once you swapped nvcr.io#nvidia/pytorch:25.02-py3 and AWS/GCP specific container image that has all of the env vars and AWS EFA or GCP GPUDirect-TCP or GPUDirect-TCPXO aka FastTrak or gIB nccl net plugin included

Testing

I did some quick smoke tests to make sure this script works properly on two HGX H100 700W SXM nodes (16 total GPUs) with 400G InfiniBand NDR connected between them at 1 hop distance.

network/benchmarks/README.md

stas00 · 2025-03-09T04:19:26Z

Thanks a lot, Oren!

OrenLeung · 2025-03-09T04:33:37Z

Thanks for the PR review @stas00 !

feat: all reduce bench slurm pyxis

7f61711

OrenLeung mentioned this pull request Mar 8, 2025

[Feature]: ROCm Pyxis SLURM Container Plugin - Achieve User Experience Parity with NVIDIA ROCm/ROCm-docker#153

Open

stas00 reviewed Mar 9, 2025

View reviewed changes

network/benchmarks/README.md Outdated Show resolved Hide resolved

Update network/benchmarks/README.md

4952a40

stas00 reviewed Mar 9, 2025

View reviewed changes

network/benchmarks/README.md Show resolved Hide resolved

fix num nodes

2b120e5

stas00 merged commit 9dcec87 into stas00:master Mar 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: all reduce bench slurm pyxis #101

feat: all reduce bench slurm pyxis #101

OrenLeung commented Mar 8, 2025 •

edited

Loading

stas00 commented Mar 9, 2025

OrenLeung commented Mar 9, 2025

feat: all reduce bench slurm pyxis #101

feat: all reduce bench slurm pyxis #101

Conversation

OrenLeung commented Mar 8, 2025 • edited Loading

Testing

stas00 commented Mar 9, 2025

OrenLeung commented Mar 9, 2025

OrenLeung commented Mar 8, 2025 •

edited

Loading