[Issue]: GPU Memory Access Fault when working with tensors on non-zero index torch GPU

### Problem Description

Using tritonBLAS's matmul in a multi-GPU system results in GPU Memory Access Faults if the operations take place on a GPU other than GPU0 according to torch (`torch.device('cuda:0')`).

Inspection of `rocm-smi --showpidgpus` during a running program shows the program opens a handle to GPU0 even when no other part of the program targets that particular GPU.

This is based on torch's view of the GPUs in the system - artificially restricting the program's access to GPUs via say `HIP_VISIBLE_DEVICES` scales the device indices in torch to the device list provided by the environment variable.  Thus the issue does not happen if using `'cuda:0'` from torch's perspective even if that doesn't actually point to physical GPU0 in the node; it only occurs when using any other device besides `'cuda:0'` in torch.

### Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

### CPU

Intel(R) Xeon(R) Platinum 8480C

### GPU

Multi-GPU AMD Instinct MI300X

### ROCm Version

ROCm 7.0.0

### ROCm Component

_No response_

### Steps to Reproduce

Assuming the README's install instructions were followed to get a running copy of both tritonBLAS and torch in the chosen python environment/docker container.

Reprex script:

<details>
<summary>Click for code</summary>

```python

import time
import torch
import tritonblas


TARGET_DEVICE = torch.device('cuda:1')
HOST_DEVICE = torch.device('cpu')


def tblas_matmul_op(a: torch.Tensor,
                    b: torch.Tensor,
                    c: torch.Tensor) -> torch.Tensor:
    a_gpu = a.to(TARGET_DEVICE)
    b_gpu = b.to(TARGET_DEVICE)
    c_gpu = c.to(TARGET_DEVICE)

    # It's arg0 @ arg1 + arg2 in tritonblas's matmul.
    result_gpu = tritonblas.matmul(a_gpu, b_gpu, c_gpu)

    return result_gpu.to(HOST_DEVICE)


torch.manual_seed(42)

a = torch.rand(8192, 8192, dtype=torch.half)
b = torch.rand(8192, 8192, dtype=torch.half)
c = torch.rand(8192, 8192, dtype=torch.half)

tblas_start = time.perf_counter()
rslt_tblas = tblas_matmul_op(a, b, c)
tblas_end = time.perf_counter()

print('~~~ Results ~~~')
print(f'Tblas: {tblas_end - tblas_start} sec')

```

</details>

This version of the script will cause the following error:
```
Memory access fault by GPU node-2 (Agent handle: 0xXXXXXXXXXXXX) on address 0xXXXXXXXXXXXX. Reason: Unknown.
Aborted (core dumped)
```

Changing the `TARGET_DEVICE = torch.device('cuda:1')` line's device index to `'cuda:0'` and rerunning causes the script to run without issue.

Note that removing the device index entirely from that line and just selecting `'cuda'` will often select the `'cuda:0'` device but if the environment last ran the program pointing to a different, specific GPU index it will sometimes re-use that GPU, so for reproducing the issue the exact GPU index should always be specified.

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: GPU Memory Access Fault when working with tensors on non-zero index torch GPU #27

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: GPU Memory Access Fault when working with tensors on non-zero index torch GPU #27

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions