Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: tried to use nn.dataParallel however crashed #1421

Open
jdgh000 opened this issue Nov 13, 2024 · 11 comments
Open

[Issue]: tried to use nn.dataParallel however crashed #1421

jdgh000 opened this issue Nov 13, 2024 · 11 comments

Comments

@jdgh000
Copy link

jdgh000 commented Nov 13, 2024

Problem Description

Ran following example:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run:
if I apply nn.dataParallel to model then it occurs, without applying it works
model = nn.DataParallel(model)

code:

import sys
sys.path.append('..')
from classes import *

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    DEBUG = 0
    DEBUGL2 = 0

    def __init__(self, size, length):

        if self.DEBUG:
            print("GG: RandomDataset.__init__(size=", size, "length: ", length, ")")

        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):

        if self.DEBUGL2:
            print("GG: RandomDataset.__getitem__(index=", index, ")")

        return self.data[index]

    def __len__(self):

        if self.DEBUG:
            print("GG: RandomDataset.__len__() returning self.len: ", self.len)

        return len(self.data)

# Parameters and DataLoaders
input_size = 1000
output_size = 10

batch_size = 1000
data_size = 60000

if not torch.cuda.is_available():
    print("GPU is not detected.")
    quit(1)

device = torch.device("cuda:0")

# Create random data set: input size = 1k, data_size = 60k, batch_size: 1k.

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = Model(input_size, output_size)

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(), "output_size", output.size())

 root@u488 dataparallellism]$ sudo python3 ex1.py
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 root@u488 dataparallellism]$ nano -w "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py"
 root@u488 dataparallellism]$ cat /opt/rocm/.info/version
6.2.0-66

Operating System

rhel9

CPU

9500hx ryzen

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

Run example code with nn.dataParallel (actual code pasted in problem description):

https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@harkgill-amd
Copy link

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

@zichguan-amd
Copy link

zichguan-amd commented Nov 13, 2024

Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if rocminfo shows two graphics devices. Since integrated graphics are not supported, you can bypass it by setting the environment variable HIP_VISIBLE_DEVICES to only use the discrete GPU as documented here: https://rocmdocs.amd.com/projects/HIP/en/develop/how-to/debugging.html#making-device-visible

@jdgh000
Copy link
Author

jdgh000 commented Nov 13, 2024

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

thx, let me know,

@harkgill-amd
Copy link

As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?

os.environ['HIP_VISIBLE_DEVICES']='0'
 

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion.

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

Name: AMD EPYC 7763 64-Core Processor
Name: AMD EPYC 7763 64-Core Processor
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a

@zichguan-amd
Copy link

In that case can you run with NCCL_DEBUG=INFO or NCCL_DEBUG=TRACE for details as suggested by the error message?

@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO

sudo mkdir log ; NCCL_DEBUG=INFO sudo python3 ex1.py 2>&1 | sudo tee log/ex1-NCCL_DEBUG.INFO.log
mkdir: cannot create directory ‘log’: File exists
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/1-dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)



@jdgh000
Copy link
Author

jdgh000 commented Nov 14, 2024

seems failing in one of these:
/usr/local/lib64/python3.9/site-packages/torch/_C/init.pyi:10823:def _broadcast_coalesced(
/usr/local/lib64/python3.9/site-packages/torch/_C/_distributed_c10d.pyi:619:def _broadcast_coalesced(
but i can only function prototype, not body, can not see what is going on in these call

@zichguan-amd
Copy link

With sudo you need to use -E to preserve the environment variables. Also, can you upgrade to the latest ROCm 6.2.4 and PyTorch 2.5.1 and see if that fixes it?

@jdgh000
Copy link
Author

jdgh000 commented Nov 15, 2024

It is already torch2.6.1 and ROCm6.2.4
torch 2.5.1+rocm6.2
torchaudio 2.5.1+rocm6.2
torchvision 0.20.1+rocm6.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants