PyTorch DDP Distributed Training

Scalable image classification training using PyTorch Distributed Data Parallel (DDP). Works on Windows (CPU/GPU) and Linux/macOS. Includes TensorBoard logs and a simple results CSV for speed/accuracy tracking.

Features

DDP launcher for multi-GPU (and CPU fallback with gloo backend)
Minimal CNN model + CIFAR-10 loader
Mixed precision (if CUDA available)
TensorBoard + CSV logging under benchmarks/
Windows launch_local.bat and Linux/Mac launch_local.sh

Quickstart

1) Env

Windows (CMD):

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Linux/Mac:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2) Single-process debug (works anywhere)

python scripts/train_ddp.py --epochs 2 --batch-size 128

3) Multi-GPU / multi-process

Windows (CMD):

scripts\launch_local.bat

Linux/Mac:

bash scripts/launch_local.sh

By default, the launch scripts use 2 processes. Change NUM_PROCS in the script to match your GPUs.

4) View logs

tensorboard --logdir benchmarks/logs

5) Results

Final model: benchmarks/model_final.pt
CSV metrics: benchmarks/results.csv

Re-creating from Scratch (for README tutorial)

Create a repo, set up venv, and install deps:

python -m venv .venv && source .venv/bin/activate
pip install torch torchvision tensorboard numpy

Add a model (src/model.py), dataset (src/dataset.py), and DDP script (scripts/train_ddp.py).

Launch DDP with:

python -m torch.distributed.run --nproc_per_node=2 scripts/train_ddp.py --epochs 2

Log metrics (TensorBoard SummaryWriter + CSV).
Save artifacts under benchmarks/ and push.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docker		docker
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyTorch DDP Distributed Training

Features

Quickstart

1) Env

2) Single-process debug (works anywhere)

3) Multi-GPU / multi-process

4) View logs

5) Results

Re-creating from Scratch (for README tutorial)

About

Uh oh!

Releases

Packages

Languages

License

ASHIKalip/pytorch-ddp-distributed-training

Folders and files

Latest commit

History

Repository files navigation

PyTorch DDP Distributed Training

Features

Quickstart

1) Env

2) Single-process debug (works anywhere)

3) Multi-GPU / multi-process

4) View logs

5) Results

Re-creating from Scratch (for README tutorial)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages