Fix GPU runtime initialization and Small-K bugs #66

asunderwood · 2026-02-07T00:49:39Z

Motivation

There are 2 bugs in #63:

Small-K operations cause Triton errors which result in HipRuntimeErrors.
Some programs using tritonBLAS and torch.compile could see GPU RuntimeErrors due to repeated GPU initializations.

This PR aims to fix those.

Technical Details

Small-K Bug

This bug is fixed by changing the tl.assume in gemm_context.py to accept 0 as a valid value for num_k_tiles. This situation arises when K is less than the BLK_K size used in the kernel and thus num_k_tiles appears to be zero. The code can handle this case, but the tl.assume raised an error.

GPU RuntimeErrors

Some tensors used as global buffers for the StreamK kernels were initialized outside of a function in matmul.py, which meant they were created at module import in the parent program. This could cause issues in several situations, but most notably caused intermittent issues with torch.compile when Inductor uses multiple threads to perform graph passes and each pass attempted to initialize GPU resources at the same time.

This is fixed by moving to a lazy-initialization approach which wraps the tensors in a function that is checked when a StreamK kernel is called, preventing GPU initialization immediately at module import.

Additionally, we now initialize a set of locks based on the GPU the tensors from the active StreamK op reside on whereas before they were initialized on the default GPU returned by torch. This may fix #27 but further testing is required - regardless it is a useful improvement for the multi-GPU and non-default GPU use-case.

Test Plan

New edge-case matrix sizes which exercise the k dimension where the tl.assume issue occurred were added to both test_matmul_correctness.py and test_addmm_correctness.py to ensure both the problem is fixed and we should catch future regression.

The tests were also updated to remove the previous workaround regarding the GPU runtime issues (which forced Inductor to use a single thread) now that the issue is solved in a more robust way.

Test Result

$ pytest tests/test_matmul_correctness.py tests/test_addmm_correctness.py

~ snip ~

===== 3 failed, 313 passed, 15 warnings in 159.07s (0:02:39) =====

The failed tests are the same StreamK issues from the previous PR that will still need a fix.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

This commit moves the StreamK locks/buffers from single-default-device initialization at module import to per-device tensor-dependent lazy initialization. This fixes the previous error causing conflicting GPU runtime initializations in tests and should also better support multi-GPU setups because the locks no longer live on a single device (which may not have been the target device in the first place).

asunderwood added 3 commits February 6, 2026 19:24

Fix tl.assume for very small k (k < blk_k)

ec11372

Add new test matrix sizes based on previous issues

2416f79

asunderwood requested review from neoblizz and ryanswann-amd February 7, 2026 00:49

asunderwood mentioned this pull request Feb 9, 2026

Fix StreamK Kernel Bias #67

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU runtime initialization and Small-K bugs #66

Fix GPU runtime initialization and Small-K bugs #66

asunderwood commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix GPU runtime initialization and Small-K bugs #66

Are you sure you want to change the base?

Fix GPU runtime initialization and Small-K bugs #66

Conversation

asunderwood commented Feb 7, 2026

Motivation

Technical Details

Small-K Bug

GPU RuntimeErrors

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant