Add Fused Multi-Head Attention example #16

AntonOresten · 2026-01-10T16:01:36Z

See also #15

Seems to fall slightly short of my NNop / ONIONop baseline (no WMMA), although I haven't compared it to the Python version. On my GPU, it compiles and runs fastest with tile_n=32 and tile_m=32:

julia> begin
           T = Float32
           D, QL, KL, H, B = 64, 4096, 4096, 4, 4
           q = CUDA.randn(T, D, QL, H, B)
           k = CUDA.randn(T, D, KL, H, B)  
           v = CUDA.randn(T, D, KL, H, B)
       end;

julia> @b CUDA.@sync ONIONop.flash_attention(q, k, v, causal=false)
9.559 ms (339 allocs: 7.875 KiB)

julia> @b CUDA.@sync cutile_fmha(q, k, v, causal=false, tile_m=32, tile_n=32)
11.058 ms (540 allocs: 23.109 KiB)

Notably, cutile-python has a latency argument for ct.load, as well as num_ctas and occupancy arguments for the kernel, which might affect performance. The python version also does a kernel config autotune by searching a space of hand-picked configurations.

Currently only tested on Float32.

Another thing that might be important for correctness or covering edge cases is exposing flush_to_zero? Used in e.g. exp2.

AntonOresten · 2026-01-17T23:21:36Z

Seeing some weird erroring when branching:

This works:

        qk = if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            mask = mask .& (offs_n .<= k_seqlen)
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk .+ mask
        else
            qk
        end

but this doesn't:

        if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            mask = mask .& (offs_n .<= k_seqlen)
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk = qk .+ mask
        end

nor does this:

        qk = if !EVEN_K[] && j >= mask_start
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            if !EVEN_K[]
                mask .& (offs_n .<= k_seqlen)
            end
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk .+ mask
        else
            qk
        end

In the second and third block, I get "ERROR: SSAValue %___ not found in context"

after removing the second condition, I can suddenly have a nested if block, and I don't need the outer else block:

        if !EVEN_K[]
            offs_n = ((j-Int32(1)) * TILE_N[]) .+ offs_n_tile
            mask = ct.full((TILE_N[], TILE_M[]), true, Bool)
            if !EVEN_K[]
                mask = mask .& (offs_n .<= k_seqlen)
            end
            mask = ct.where(mask, ct.zeros((TILE_N[], TILE_M[],), Float32), ct.full((TILE_N[], TILE_M[],), -Inf32, Float32))
            qk = qk .+ mask
        end

Does the if block need to depend on compile time constants?

I'd need this to make the padding and causal mask properly.

maleadt · 2026-01-19T11:17:12Z

In the second and third block, I get "ERROR: SSAValue %___ not found in context"

That's an IRStructurizer error. Can you provide an MWE?

AntonOresten added 2 commits January 10, 2026 17:03

add mod1, max, min

9178f78

add fmha

1bd757c

AntonOresten force-pushed the fmha branch from f7888b1 to 1bd757c Compare January 10, 2026 16:03

AntonOresten added 3 commits January 10, 2026 18:13

fix tests

905e732

Update fmha.jl

d6f9d9b

Merge branch 'main' into fmha

5c27549

This was referenced Jan 13, 2026

Expose entry hints through launch #27

Merged

Allow redefinition of kernel methods #31

Merged

Expose load/store optimization hints #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Fused Multi-Head Attention example #16

Add Fused Multi-Head Attention example #16

Uh oh!

AntonOresten commented Jan 10, 2026 •

edited

Loading

Uh oh!

AntonOresten commented Jan 17, 2026 •

edited

Loading

Uh oh!

maleadt commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Fused Multi-Head Attention example #16

Are you sure you want to change the base?

Add Fused Multi-Head Attention example #16

Uh oh!

Conversation

AntonOresten commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Jan 10, 2026 •

edited

Loading

AntonOresten commented Jan 17, 2026 •

edited

Loading