Skip to content

Commit

Permalink
[cifar ds training]: Set cuda device during initialization of distrib…
Browse files Browse the repository at this point in the history
…uted backend. (#931)

* Set cuda device during initialization of distributed backend.

The commit is needed to avoid GPU 0 being set as the
compute stream via torch.cuda.current_stream() during initialization
across all GPUs.

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

* Use device-agnostic accelerator API.

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
  • Loading branch information
jagadish-amd authored Oct 29, 2024
1 parent f73a6ed commit 130fb58
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions training/cifar/cifar10_deepspeed.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import argparse
import os

import deepspeed
import torch
Expand Down Expand Up @@ -279,6 +280,8 @@ def test(model_engine, testset, local_device, target_dtype, test_batch_size=4):
def main(args):
# Initialize DeepSpeed distributed backend.
deepspeed.init_distributed()
_local_rank = int(os.environ.get("LOCAL_RANK"))
get_accelerator().set_device(_local_rank)

########################################################################
# Step1. Data Preparation.
Expand Down

0 comments on commit 130fb58

Please sign in to comment.