-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Problem Description
Using tritonBLAS's matmul in a multi-GPU system results in GPU Memory Access Faults if the operations take place on a GPU other than GPU0 according to torch (torch.device('cuda:0')).
Inspection of rocm-smi --showpidgpus during a running program shows the program opens a handle to GPU0 even when no other part of the program targets that particular GPU.
This is based on torch's view of the GPUs in the system - artificially restricting the program's access to GPUs via say HIP_VISIBLE_DEVICES scales the device indices in torch to the device list provided by the environment variable. Thus the issue does not happen if using 'cuda:0' from torch's perspective even if that doesn't actually point to physical GPU0 in the node; it only occurs when using any other device besides 'cuda:0' in torch.
Operating System
Ubuntu 22.04.3 LTS (Jammy Jellyfish)
CPU
Intel(R) Xeon(R) Platinum 8480C
GPU
Multi-GPU AMD Instinct MI300X
ROCm Version
ROCm 7.0.0
ROCm Component
No response
Steps to Reproduce
Assuming the README's install instructions were followed to get a running copy of both tritonBLAS and torch in the chosen python environment/docker container.
Reprex script:
Click for code
import time
import torch
import tritonblas
TARGET_DEVICE = torch.device('cuda:1')
HOST_DEVICE = torch.device('cpu')
def tblas_matmul_op(a: torch.Tensor,
b: torch.Tensor,
c: torch.Tensor) -> torch.Tensor:
a_gpu = a.to(TARGET_DEVICE)
b_gpu = b.to(TARGET_DEVICE)
c_gpu = c.to(TARGET_DEVICE)
# It's arg0 @ arg1 + arg2 in tritonblas's matmul.
result_gpu = tritonblas.matmul(a_gpu, b_gpu, c_gpu)
return result_gpu.to(HOST_DEVICE)
torch.manual_seed(42)
a = torch.rand(8192, 8192, dtype=torch.half)
b = torch.rand(8192, 8192, dtype=torch.half)
c = torch.rand(8192, 8192, dtype=torch.half)
tblas_start = time.perf_counter()
rslt_tblas = tblas_matmul_op(a, b, c)
tblas_end = time.perf_counter()
print('~~~ Results ~~~')
print(f'Tblas: {tblas_end - tblas_start} sec')This version of the script will cause the following error:
Memory access fault by GPU node-2 (Agent handle: 0xXXXXXXXXXXXX) on address 0xXXXXXXXXXXXX. Reason: Unknown.
Aborted (core dumped)
Changing the TARGET_DEVICE = torch.device('cuda:1') line's device index to 'cuda:0' and rerunning causes the script to run without issue.
Note that removing the device index entirely from that line and just selecting 'cuda' will often select the 'cuda:0' device but if the environment last ran the program pointing to a different, specific GPU index it will sometimes re-use that GPU, so for reproducing the issue the exact GPU index should always be specified.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response