Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: all reduce bench slurm pyxis #101

Merged
merged 3 commits into from
Mar 9, 2025

Conversation

OrenLeung
Copy link
Contributor

@OrenLeung OrenLeung commented Mar 8, 2025

SLURM Pyxis Container Plugin allows for easy reproducible scripts where the environment is containerized as such the benchmark harness script does not rely on any host machine dependencies besides the standard SLURM, pyxis SLURM plugin and nvidia-drivers.

SLURM Pyxis Container Plugin is widely used across many companies and increasingly gaining adoption.

On CSPs that have enabled SLURM Pyxis Container Plugin, such as CoreWeave, Crusoe, Oracle, Azure, etc, all_reduce_bench.py can be easily ran & reproduced via the following command:

sbatch -n <num_of_nodes> ./all_reduce_bench_pyxis.sbatch

Note that on AWS & GCP, this launcher will work too once you swapped nvcr.io#nvidia/pytorch:25.02-py3 and AWS/GCP specific container image that has all of the env vars and AWS EFA or GCP GPUDirect-TCP or GPUDirect-TCPXO aka FastTrak or gIB nccl net plugin included

Testing

I did some quick smoke tests to make sure this script works properly on two HGX H100 700W SXM nodes (16 total GPUs) with 400G InfiniBand NDR connected between them at 1 hop distance.

image

image

@stas00 stas00 merged commit 9dcec87 into stas00:master Mar 9, 2025
@stas00
Copy link
Owner

stas00 commented Mar 9, 2025

Thanks a lot, Oren!

@OrenLeung
Copy link
Contributor Author

Thanks for the PR review @stas00 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants