Skip to content

Conversation

@asunderwood
Copy link
Collaborator

Motivation

There are 2 bugs in #63:

  • Small-K operations cause Triton errors which result in HipRuntimeErrors.
  • Some programs using tritonBLAS and torch.compile could see GPU RuntimeErrors due to repeated GPU initializations.

This PR aims to fix those.

Technical Details

Small-K Bug

This bug is fixed by changing the tl.assume in gemm_context.py to accept 0 as a valid value for num_k_tiles. This situation arises when K is less than the BLK_K size used in the kernel and thus num_k_tiles appears to be zero. The code can handle this case, but the tl.assume raised an error.

GPU RuntimeErrors

Some tensors used as global buffers for the StreamK kernels were initialized outside of a function in matmul.py, which meant they were created at module import in the parent program. This could cause issues in several situations, but most notably caused intermittent issues with torch.compile when Inductor uses multiple threads to perform graph passes and each pass attempted to initialize GPU resources at the same time.

This is fixed by moving to a lazy-initialization approach which wraps the tensors in a function that is checked when a StreamK kernel is called, preventing GPU initialization immediately at module import.

Additionally, we now initialize a set of locks based on the GPU the tensors from the active StreamK op reside on whereas before they were initialized on the default GPU returned by torch. This may fix #27 but further testing is required - regardless it is a useful improvement for the multi-GPU and non-default GPU use-case.

Test Plan

New edge-case matrix sizes which exercise the k dimension where the tl.assume issue occurred were added to both test_matmul_correctness.py and test_addmm_correctness.py to ensure both the problem is fixed and we should catch future regression.

The tests were also updated to remove the previous workaround regarding the GPU runtime issues (which forced Inductor to use a single thread) now that the issue is solved in a more robust way.

Test Result

$ pytest tests/test_matmul_correctness.py tests/test_addmm_correctness.py

~ snip ~

===== 3 failed, 313 passed, 15 warnings in 159.07s (0:02:39) =====

The failed tests are the same StreamK issues from the previous PR that will still need a fix.

Submission Checklist

This commit moves the StreamK locks/buffers from single-default-device
initialization at module import to per-device tensor-dependent lazy
initialization.  This fixes the previous error causing conflicting GPU
runtime initializations in tests and should also better support
multi-GPU setups because the locks no longer live on a single device
(which may not have been the target device in the first place).
@asunderwood asunderwood mentioned this pull request Feb 9, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Issue]: GPU Memory Access Fault when working with tensors on non-zero index torch GPU

1 participant