Distributed Training on HPC

This repo presents a simple tutorial for distributed training in HPC. The single_node.sh trains the network with two gpus in the same node while the multiple_nodes.sh trains the same network with two GPUs at each node.

Notes

Crucial parameters in torchrun:
- --nnodes=number of nodes
- --nproc_per_node=number of GPUs per node
Crucial parameters in the multiple_nodes.sh script:
- #BSUB -n 16: Define the number of total CPU cores. Must be at least 4 for each GPU
- #BSUB -R "span[ptile=8]": The number of cores in each node
- Note that we don't explicity ask for a number of nodes but the number of nodes is calculated as the number of total cores devided by the number of cores in each node.
- Unlike the script single_node.sh, the command #BSUB -R "span[hosts=1]" is omitted as we want more than one hosts.

Resources

Installation

module load python3/3.10.12
module load cuda/11.8
python3 -m venv ~/dist-training
source ~/dist-training/bin/activate
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
dist.py		dist.py
multiple_nodes.sh		multiple_nodes.sh
requirements.txt		requirements.txt
single_node.sh		single_node.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training on HPC

Notes

Resources

Installation

About

Releases

Packages

Languages

thanosDelatolas/hpc_distributed_example

Folders and files

Latest commit

History

Repository files navigation

Distributed Training on HPC

Notes

Resources

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages