stas00 · stas00 · Mar 9, 2025 · Mar 8, 2025 · Mar 9, 2025 · Mar 9, 2025
diff --git a/network/benchmarks/README.md b/network/benchmarks/README.md
@@ -79,9 +79,15 @@ Here is a simple all-reduce benchmark that you can use to quickly measure the th
 
 [all_reduce_bench.py](all_reduce_bench.py)
 
+On CSPs that have enabled [SLURM Pyxis Container Plugin](https://github.com/NVIDIA/pyxis), such as CoreWeave, Crusoe, AWS, Oracle, Azure, GCP, etc, `all_reduce_bench.py` can be easily ran & reproduced via the following command:
+```bash
+sbatch -n <num_of_nodes> ./all_reduce_bench_pyxis.sbatch
+```
+
 Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
 
-To run it on 4 nodes:
+
+If you do not have access to a pyxis SLURM environment, to run it on 4 nodes:
 
 ```
 GPUS_PER_NODE=8

diff --git a/network/benchmarks/all_reduce_bench_pyxis.sbatch b/network/benchmarks/all_reduce_bench_pyxis.sbatch
@@ -0,0 +1,24 @@
+#!/bin/bash
+#SBATCH --job-name=all_reduce_bench_pyxis
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=1
+#SBATCH --gres=gpu:8
+#SBATCH --time=01:00:00
+
+# Set up environment variables for torchrun
+GPUS_PER_NODE=8
+NNODES=2
+MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+MASTER_PORT=6000
+
+srun --container-image=nvcr.io#nvidia/pytorch:25.02-py3 \
+     --container-mounts=$PWD:/workspace \
+     python -u -m torch.distributed.run \
+         --nproc_per_node $GPUS_PER_NODE \
+         --nnodes $NNODES \
+         --rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
+         --rdzv_backend c10d \
+         --max_restarts 0 \
+         --role `hostname -s`':' \
+         --tee 3 \
+         all_reduce_bench.py