Skip to content

Enable Triton MXFP4 MoE on gfx950 for GPT-OSS#220

Open
ChuanLi1101 wants to merge 1 commit intogpt-oss-allreduce-rmsnorm-fusionfrom
gpt-oss-triton-moe-gfx950
Open

Enable Triton MXFP4 MoE on gfx950 for GPT-OSS#220
ChuanLi1101 wants to merge 1 commit intogpt-oss-allreduce-rmsnorm-fusionfrom
gpt-oss-triton-moe-gfx950

Conversation

@ChuanLi1101
Copy link
Collaborator

@ChuanLi1101 ChuanLi1101 commented Feb 16, 2026

Summary

  • Extend the Triton MoE kernel path (matmul_ogs + routing from triton_kernels) to gfx950 (MI355X) when ATOM_USE_TRITON_GEMM=1
  • Enables GPT-OSS MXFP4 models on MI355X to use the optimized Triton MoE path with fused routing, Swiglu activation (alpha=1.702, limit=7.0), and matmul_ogs GEMM
  • Opt-in only: without ATOM_USE_TRITON_GEMM=1, gfx950 continues to use the CK/ASM MoE path (no behavior change)

Motivation

The triton_kernels package already supports gfx950 via GFX950MXScaleLayout in _swizzle_mxfp4 (see fused_moe_triton.py), but Mxfp4MoEMethod only enables the Triton path for gfx94x. This PR extends it to gfx950 when explicitly requested via env var.

Builds on #218 (AllReduce+RMSNorm fusion for GPT-OSS).

Changes

  • atom/model_ops/moe.py: Extended use_triton check in Mxfp4MoEMethod to include gfx950 when ATOM_USE_TRITON_GEMM=1

Precision

  • No precision loss: same MXFP4 weight data (just different layout), same Swiglu activation parameters, same softmax routing
  • The Triton path uses the same weight data as CK, just swizzled for GPU efficiency via _swizzle_mxfp4

Test Plan

  • Run GPT-OSS-120B MXFP4 inference with ATOM_USE_TRITON_GEMM=1 on MI355X
  • Compare output accuracy against CK path (without ATOM_USE_TRITON_GEMM)
  • Benchmark throughput on InferenceMax with various ISL/OSL combinations
  • Verify no regression when ATOM_USE_TRITON_GEMM is not set (default CK path)

Extend the Triton MoE kernel path (matmul_ogs + routing from triton_kernels) to gfx950 (MI355X) when ATOM_USE_TRITON_GEMM is enabled. The triton_kernels package already supports gfx950 via GFX950MXScaleLayout.

This allows GPT-OSS MXFP4 models on MI355X to use the optimized Triton MoE path with fused routing, Swiglu activation, and matmul_ogs GEMM. The change is opt-in: without ATOM_USE_TRITON_GEMM=1, gfx950 continues to use the CK/ASM path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link

@azaidy azaidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@valarLip
Copy link
Collaborator

works for triton>=3.5?

@ChuanLi1101
Copy link
Collaborator Author

This supposed to be version agnostic. BTW, I thought no triton version check exists anywhere in the codebase. The only guard is has_triton_kernels() which checks if the package is importable?

@valarLip
Copy link
Collaborator

This supposed to be version agnostic. BTW, I thought no triton version check exists anywhere in the codebase. The only guard is has_triton_kernels() which checks if the package is importable?

triton_kernels and triton are two pip packge.. it will be great if we can hold the kernels we need in aiter

@ChuanLi1101
Copy link
Collaborator Author

Good point — moving MoE kernels from triton_kernels into aiter would simplify the deps (one less external pkg).

That said, this PR doesn’t add a new dependency — it just extends the existing triton_kernels path (already used for gfx94x) to gfx950. The has_triton_kernels() guard + import have been there since the original Triton MoE integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants