Skip to content

Commit 9dcec87

Browse files
OrenLeung=orenstas00
authored
feat: all reduce bench slurm pyxis (#101)
* feat: all reduce bench slurm pyxis * Update network/benchmarks/README.md * fix num nodes --------- Co-authored-by: =oren <=oren.leung@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
1 parent 6524f46 commit 9dcec87

File tree

2 files changed

+31
-1
lines changed

2 files changed

+31
-1
lines changed

network/benchmarks/README.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,15 @@ Here is a simple all-reduce benchmark that you can use to quickly measure the th
7979

8080
[all_reduce_bench.py](all_reduce_bench.py)
8181

82+
On CSPs that have enabled [SLURM Pyxis Container Plugin](https://github.com/NVIDIA/pyxis), such as CoreWeave, Crusoe, AWS, Oracle, Azure, GCP, etc, `all_reduce_bench.py` can be easily ran & reproduced via the following command:
83+
```bash
84+
sbatch -n <num_of_nodes> ./all_reduce_bench_pyxis.sbatch
85+
```
86+
8287
Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
8388

84-
To run it on 4 nodes:
89+
90+
If you do not have access to a pyxis SLURM environment, to run it on 4 nodes:
8591

8692
```
8793
GPUS_PER_NODE=8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=all_reduce_bench_pyxis
3+
#SBATCH --nodes=2
4+
#SBATCH --ntasks-per-node=1
5+
#SBATCH --gres=gpu:8
6+
#SBATCH --time=01:00:00
7+
8+
# Set up environment variables for torchrun
9+
GPUS_PER_NODE=8
10+
NNODES=$SLURM_NNODES
11+
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
12+
MASTER_PORT=6000
13+
14+
srun --container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
15+
--container-mounts=$PWD:/workspace \
16+
python -u -m torch.distributed.run \
17+
--nproc_per_node $GPUS_PER_NODE \
18+
--nnodes $NNODES \
19+
--rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
20+
--rdzv_backend c10d \
21+
--max_restarts 0 \
22+
--role `hostname -s`':' \
23+
--tee 3 \
24+
all_reduce_bench.py

0 commit comments

Comments
 (0)