[cifar ds training]: Set cuda device during initialization of distributed backend. #931

jagadish-amd · 2024-10-15T19:16:31Z

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs.
The perf RunningAvgSamplesPerSec metrics improves on a multi gpu node, tested on AMD GPU with ROCm stack.
As number of GPUs increases; without this commit, GPU 0 takes in more load compared to other GPUs.

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

jagadish-amd · 2024-10-15T19:21:52Z

ping @jeffdaily

training/cifar/cifar10_deepspeed.py

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

jagadish-amd · 2024-10-17T19:46:07Z

@tjruwase can you please review / merge ?

tjruwase · 2024-10-29T13:47:39Z

@tjruwase can you please review / merge ?

@jagadish-amd, apologies for the delay. Done.

Set cuda device during initialization of distributed backend.

3d679a5

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

jagadish-amd requested review from tjruwase and awan-10 as code owners October 15, 2024 19:16

jeffdaily approved these changes Oct 15, 2024

View reviewed changes

tjruwase reviewed Oct 15, 2024

View reviewed changes

training/cifar/cifar10_deepspeed.py Outdated Show resolved Hide resolved

Use device-agnostic accelerator API.

7f91988

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

tjruwase approved these changes Oct 29, 2024

View reviewed changes

tjruwase merged commit 130fb58 into microsoft:master Oct 29, 2024
2 checks passed

jagadish-amd deleted the cifar-set_device branch October 29, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

jagadish-amd commented Oct 15, 2024 •

edited

Loading

jagadish-amd commented Oct 15, 2024

jagadish-amd commented Oct 17, 2024 •

edited

Loading

tjruwase commented Oct 29, 2024

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

Conversation

jagadish-amd commented Oct 15, 2024 • edited Loading

jagadish-amd commented Oct 15, 2024

jagadish-amd commented Oct 17, 2024 • edited Loading

tjruwase commented Oct 29, 2024

jagadish-amd commented Oct 15, 2024 •

edited

Loading

jagadish-amd commented Oct 17, 2024 •

edited

Loading