Add Fused Multi-Head Attention example #16
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See also #15
Seems to fall slightly short of my NNop / ONIONop baseline (no WMMA), although I haven't compared it to the Python version. On my GPU, it compiles and runs fastest with tile_n=32 and tile_m=32:
Notably, cutile-python has a
latencyargument forct.load, as well asnum_ctasandoccupancyarguments for the kernel, which might affect performance. The python version also does a kernel config autotune by searching a space of hand-picked configurations.Currently only tested on Float32.
Another thing that might be important for correctness or covering edge cases is exposing flush_to_zero? Used in e.g.
exp2.