Fix GPU runtime initialization and Small-K bugs #66
+37
−26
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
There are 2 bugs in #63:
This PR aims to fix those.
Technical Details
Small-K Bug
This bug is fixed by changing the
tl.assumeingemm_context.pyto accept 0 as a valid value fornum_k_tiles. This situation arises when K is less than the BLK_K size used in the kernel and thusnum_k_tilesappears to be zero. The code can handle this case, but thetl.assumeraised an error.GPU RuntimeErrors
Some tensors used as global buffers for the StreamK kernels were initialized outside of a function in
matmul.py, which meant they were created at module import in the parent program. This could cause issues in several situations, but most notably caused intermittent issues with torch.compile when Inductor uses multiple threads to perform graph passes and each pass attempted to initialize GPU resources at the same time.This is fixed by moving to a lazy-initialization approach which wraps the tensors in a function that is checked when a StreamK kernel is called, preventing GPU initialization immediately at module import.
Additionally, we now initialize a set of locks based on the GPU the tensors from the active StreamK op reside on whereas before they were initialized on the default GPU returned by torch. This may fix #27 but further testing is required - regardless it is a useful improvement for the multi-GPU and non-default GPU use-case.
Test Plan
New edge-case matrix sizes which exercise the k dimension where the
tl.assumeissue occurred were added to bothtest_matmul_correctness.pyandtest_addmm_correctness.pyto ensure both the problem is fixed and we should catch future regression.The tests were also updated to remove the previous workaround regarding the GPU runtime issues (which forced Inductor to use a single thread) now that the issue is solved in a more robust way.
Test Result
The failed tests are the same StreamK issues from the previous PR that will still need a fix.
Submission Checklist