Skip to content

ASHIKalip/pytorch-ddp-distributed-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch DDP Distributed Training

Scalable image classification training using PyTorch Distributed Data Parallel (DDP). Works on Windows (CPU/GPU) and Linux/macOS. Includes TensorBoard logs and a simple results CSV for speed/accuracy tracking.

Features

  • DDP launcher for multi-GPU (and CPU fallback with gloo backend)
  • Minimal CNN model + CIFAR-10 loader
  • Mixed precision (if CUDA available)
  • TensorBoard + CSV logging under benchmarks/
  • Windows launch_local.bat and Linux/Mac launch_local.sh

Quickstart

1) Env

Windows (CMD):

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Linux/Mac:

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2) Single-process debug (works anywhere)

python scripts/train_ddp.py --epochs 2 --batch-size 128

3) Multi-GPU / multi-process

Windows (CMD):

scripts\launch_local.bat

Linux/Mac:

bash scripts/launch_local.sh

By default, the launch scripts use 2 processes. Change NUM_PROCS in the script to match your GPUs.

4) View logs

tensorboard --logdir benchmarks/logs

5) Results

  • Final model: benchmarks/model_final.pt
  • CSV metrics: benchmarks/results.csv

Re-creating from Scratch (for README tutorial)

  1. Create a repo, set up venv, and install deps:

    python -m venv .venv && source .venv/bin/activate
    pip install torch torchvision tensorboard numpy
  2. Add a model (src/model.py), dataset (src/dataset.py), and DDP script (scripts/train_ddp.py).

  3. Launch DDP with:

    python -m torch.distributed.run --nproc_per_node=2 scripts/train_ddp.py --epochs 2
  4. Log metrics (TensorBoard SummaryWriter + CSV).

  5. Save artifacts under benchmarks/ and push.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published