Skip to content

[Issue]: GPU Memory Access Fault when working with tensors on non-zero index torch GPU #27

@asunderwood

Description

@asunderwood

Problem Description

Using tritonBLAS's matmul in a multi-GPU system results in GPU Memory Access Faults if the operations take place on a GPU other than GPU0 according to torch (torch.device('cuda:0')).

Inspection of rocm-smi --showpidgpus during a running program shows the program opens a handle to GPU0 even when no other part of the program targets that particular GPU.

This is based on torch's view of the GPUs in the system - artificially restricting the program's access to GPUs via say HIP_VISIBLE_DEVICES scales the device indices in torch to the device list provided by the environment variable. Thus the issue does not happen if using 'cuda:0' from torch's perspective even if that doesn't actually point to physical GPU0 in the node; it only occurs when using any other device besides 'cuda:0' in torch.

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

Multi-GPU AMD Instinct MI300X

ROCm Version

ROCm 7.0.0

ROCm Component

No response

Steps to Reproduce

Assuming the README's install instructions were followed to get a running copy of both tritonBLAS and torch in the chosen python environment/docker container.

Reprex script:

Click for code
import time
import torch
import tritonblas


TARGET_DEVICE = torch.device('cuda:1')
HOST_DEVICE = torch.device('cpu')


def tblas_matmul_op(a: torch.Tensor,
                    b: torch.Tensor,
                    c: torch.Tensor) -> torch.Tensor:
    a_gpu = a.to(TARGET_DEVICE)
    b_gpu = b.to(TARGET_DEVICE)
    c_gpu = c.to(TARGET_DEVICE)

    # It's arg0 @ arg1 + arg2 in tritonblas's matmul.
    result_gpu = tritonblas.matmul(a_gpu, b_gpu, c_gpu)

    return result_gpu.to(HOST_DEVICE)


torch.manual_seed(42)

a = torch.rand(8192, 8192, dtype=torch.half)
b = torch.rand(8192, 8192, dtype=torch.half)
c = torch.rand(8192, 8192, dtype=torch.half)

tblas_start = time.perf_counter()
rslt_tblas = tblas_matmul_op(a, b, c)
tblas_end = time.perf_counter()

print('~~~ Results ~~~')
print(f'Tblas: {tblas_end - tblas_start} sec')

This version of the script will cause the following error:

Memory access fault by GPU node-2 (Agent handle: 0xXXXXXXXXXXXX) on address 0xXXXXXXXXXXXX. Reason: Unknown.
Aborted (core dumped)

Changing the TARGET_DEVICE = torch.device('cuda:1') line's device index to 'cuda:0' and rerunning causes the script to run without issue.

Note that removing the device index entirely from that line and just selecting 'cuda' will often select the 'cuda:0' device but if the environment last ran the program pointing to a different, specific GPU index it will sometimes re-use that GPU, so for reproducing the issue the exact GPU index should always be specified.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions