Conversation
- Add kernels/simple_gemm.py: Simple GEMM kernel (C = A × B^T) for AMD GPUs using MFMA instructions with XOR16 LDS swizzle and boundary checks for non-aligned M, N, K dimensions - Add tests/kernels/test_simple_gemm.py: Test script with aligned and non-aligned dimension test cases - Add tests/kernels/test_moe_stage1_simple.py: Standalone test script for MoE Stage1 kernel - Add run.sh: Shell script for running tests and collecting ROCm thread traces - Add input.yaml: ROCm profiler configuration for thread trace collection Co-authored-by: Cursor <cursoragent@cursor.com>
…n simple GEMM - Add waves_per_eu parameter to compiler.compile() for AMDGPU occupancy hints - Implement _apply_waves_per_eu_on_llvm_funcs() to set amdgpu-waves-per-eu attribute on GPU kernel functions via LLVM passthrough - Refactor simple_gemm to use mask-based loads/stores for M/N boundaries instead of host-side padding (Triton-like approach) - Only K dimension is padded on host (required for MFMA vector loads) - Add --waves_per_eu CLI argument to test_simple_gemm.py Co-authored-by: Cursor <cursoragent@cursor.com>
- Add OOB_OFFSET (0x80000000) and MAX_NUM_RECORDS (0x7FFFFFFE) constants that match Triton's BufferOpsEmitter for reliable hardware OOB detection - Update buffer load/store to use OOB_OFFSET for masked-out elements, ensuring hardware always detects OOB when mask=False - Simplify GEMM kernel masking by removing redundant K boundary checks since K dimension is guaranteed to be padded to tile_k - Enable additional test cases in run.sh Co-authored-by: Cursor <cursoragent@cursor.com>
- Add unsafe_fp_math and fast_fp_math parameters to compiler pipeline - Replace __ocml_exp2_f32 library calls with llvm.intr.exp2 intrinsics - Apply unsafe-fp-math function attributes to GPU kernel llvm.func ops - Add fastmath parameter support to arith.maximum operation - Improve test reproducibility with seed control and MD5 hash comparison - Add detailed array comparison utility for debugging numerical differences Co-authored-by: Cursor <cursoragent@cursor.com>
- Switched the **active** `v4_4` kernel path to a true **MFMA32** pipeline (`mfma_f32_32x32x8f16`) with `BLOCK_M=128`, `BLOCK_N=32`, `NUM_WAVES=4`. - Remapped compute flow to **`K @ Q^T -> online softmax -> V^T @ P`**. - Kept intermediate **S/P in registers** (removed the previous `P -> LDS -> VGPR` roundtrip). - Split LDS staging for K and `V^T` into separate regions and removed an inner-loop barrier to cut synchronization overhead. - Updated test constraints and compile options in `test_flash_attention_v4_4.py` (`seq_len % 128`, `head_dim % 32`, `waves_per_eu=3`). - Final measured result at target shape: **12350.8 us/iter**, with accuracy preserved (`diff.abs.max=4.88e-4`, `max_diff_thr=3.255208e-04`), about **2.17x faster** than the previous 26751.5 us.
Add gated CK-style N128/prefetch/reduction experiments plus ROCDL phase-fence wrappers so performance tuning can be A/B tested without regressing the stable target-shape path. Co-authored-by: Cursor <cursoragent@cursor.com>
Align kernel, test, and run-script references so the renamed entrypoint is used consistently across build and benchmark workflows. Co-authored-by: Cursor <cursoragent@cursor.com>
…c benchmarking. Drop the v4.3 comparison path to keep tests focused on flash_attn_func and align run.sh defaults with the updated benchmark flow. Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve merge conflict in flydsl/src/flydsl/compiler/compiler.py: - Keep all PR functions: _replace_ocml_exp2_with_intrinsic, _apply_unsafe_fp_math_on_llvm_funcs, _apply_waves_per_eu_on_llvm_funcs, _apply_flat_work_group_size_on_llvm_funcs - Keep main's _apply_waves_per_eu_hint (gpu.func level, complementary) - Combine compile() signature: keep waves_per_eu/flat_work_group_size/ unsafe_fp_math/fast_fp_math params from PR, adopt Optional return type from main Co-authored-by: Cursor <cursoragent@cursor.com>
| pass | ||
|
|
||
|
|
||
| def _apply_waves_per_eu_on_llvm_funcs(module: ir.Module, waves_per_eu: int) -> None: |
There was a problem hiding this comment.
there has already this function, similar.
There was a problem hiding this comment.
Good catch! The old _apply_waves_per_eu_hint operated on gpu.func (pre-LLVM lowering) via rocdl.waves_per_eu, while the new _apply_waves_per_eu_on_llvm_funcs operates on llvm.func (post-lowering) via LLVM passthrough amdgpu-waves-per-eu. The passthrough approach is more reliable since it directly controls the LLVM backend. I've removed the old function in commit ac1d477.
| return names | ||
|
|
||
|
|
||
| def _replace_ocml_exp2_with_intrinsic(module: ir.Module) -> ir.Module: |
There was a problem hiding this comment.
We can't use math.exp2 directly here — the convert-gpu-to-rocdl pass unconditionally lowers it to __ocml_exp2_f32 (a safe but slow 6-instruction library call with range reduction + v_ldexp_f32). There's no pass-level option to emit llvm.intr.exp2 instead.
This function is a post-lowering optimization: it replaces the OCML library call with llvm.intr.exp2 + fast math flags, giving us a single v_exp_f32 instruction. I've updated the docstring in commit ac1d477 to clarify this rationale and added a TODO to replace with a proper MLIR pass when upstream support is available.
| # Descriptor uses i32 bytes; clamp to the max representable. | ||
| if nbytes > 0xFFFFFFFF: | ||
| nbytes = 0xFFFFFFFF | ||
| # Clamp to MAX_NUM_RECORDS to ensure OOB_OFFSET works correctly. |
There was a problem hiding this comment.
why change this? use dynamic shapes?
There was a problem hiding this comment.
Not related to dynamic shapes. This is a correctness fix for masked buffer loads/stores.
The previous code used num_records=0xFFFFFFFF with mask_offset=0x7FFFFFFF. The GPU does unsigned comparison offset < num_records for OOB detection, so 0x7FFFFFFF < 0xFFFFFFFF = true — the mask never triggers OOB, which is a bug.
Changed to match Triton's approach:
MAX_NUM_RECORDS = 0x7FFFFFFEOOB_OFFSET = 0x80000000- Since
0x80000000 > 0x7FFFFFFE(unsigned), hardware OOB is always triggered when mask=False. ✅
…of removed _apply_waves_per_eu_hint
Motivation
Implement a high-performance Flash Attention forward kernel in FlyDSL, targeting AMD Instinct GPUs (MI308X/MI325X). This PR provides a pure FlyDSL implementation of causal multi-head attention (MHA) with MFMA32-based GEMM pipelines, aiming to match or approach the performance of hand-optimized CK (Composable Kernel) implementations.
Technical Details
mfma_f32_32x32x8f16for both GEMM stages (S=K@Q^T and O=V^T@P).(batch * num_q_tiles * num_heads,)wherenum_q_tiles = seq_len / BLOCK_M.head_dim % 32 == 0,head_dim >= 64,seq_len % 128 == 0.Test Plan
F.scaled_dot_product_attentionreference with max error < 1e-2 and cosine similarity > 0.99.run_perftestusing 100 iterations after 5 warmup iterations on MI325X and MI308X.Test Result
Configuration: B=1, S=8192, H=64, D=128, causal, fp16
MI325X
MI308X
Submission Checklist