[JAX] Wrapper for Permutation Triton kernel #2419

tdophung · 2025-11-25T02:13:47Z

Description

Step 2 in a multi-step process to have Jax execute Triton kernels to support MOE on single GPU
Steps:

Move Triton kernels to common
Use jax-triton to call the triton kernels
Write a JAX Primitive for this op.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

1/ Write jax-triton calls for each permutation kernels in common/triton
2/ Add test_permutation to jax. Covering:
- make row ID mapping
- chunk sorting
- permute
- unpermute

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: tdophung <tdophung@nvidia.com>

tdophung · 2025-11-25T02:16:12Z

/te-ci L0

greptile-apps · 2025-11-25T02:20:06Z

Greptile Overview

Greptile Summary

Adds JAX-Triton wrappers for permutation kernels to enable MOE support on single GPU, completing step 2 of the multi-step implementation plan.

Key Changes

Created transformer_engine/jax/triton/permutation.py with JAX wrappers for 5 permutation operations
Refactored Triton kernel parameter ordering in common/triton/permutation.py to group inputs, outputs, strides, and metadata
Added comprehensive test suite with reference implementations and roundtrip validation tests
Created new module transformer_engine/jax/triton/ with proper exports

Implementation Details

JAX wrappers compute strides manually since JAX arrays lack .strides attribute
Uses dummy tensors for None pointers as jax-triton doesn't handle None correctly
Three-pass approach for make_row_id_map: block cumsum, global cumsum, sparse-to-dense conversion
Tests cover multiple dtypes (float32, bfloat16), various token/expert/hidden size combinations, and optional probability handling

Confidence Score: 4/5

Safe to merge with minor verification recommended for parameter ordering in kernel calls
Well-structured implementation with comprehensive tests and clear documentation; score reflects need to verify the refactored parameter ordering works correctly across all JAX-Triton kernel invocations at runtime
Verify transformer_engine/jax/triton/permutation.py parameter ordering matches refactored Triton kernels - manual testing recommended

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/triton/permutation.py	5/5	Reordered kernel parameters to group input/output pointers, sizes, strides, and metas for better organization and consistency with JAX-Triton calling conventions
transformer_engine/jax/triton/permutation.py	4/5	New JAX wrapper for Triton permutation kernels with proper stride computation and dummy tensor handling for None pointers; includes all 5 operations for MOE support
transformer_engine/jax/triton/init.py	5/5	New module initialization file that exports the 5 permutation functions for public API access
tests/jax/test_permutation.py	5/5	Comprehensive test suite with reference implementations for all 5 permutation operations, including roundtrip tests and various parameter combinations

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant JAX as JAX Wrapper<br/>(jax/triton/permutation.py)
    participant JT as jax_triton
    participant Triton as Triton Kernel<br/>(common/triton/permutation.py)
    participant GPU as GPU

    Note over User,GPU: MOE Token Permutation Flow

    User->>JAX: make_row_id_map(routing_map, num_tokens, num_experts)
    JAX->>JAX: Compute strides
    JAX->>JT: triton_call(_row_id_map_pass_1_kernel)
    JT->>Triton: Execute Pass 1 (block cumsum)
    Triton->>GPU: Parallel kernel execution
    GPU-->>Triton: row_id_map_pass1, workspace
    Triton-->>JT: Return results
    JT-->>JAX: row_id_map_pass1, workspace
    
    JAX->>JT: triton_call(_row_id_map_pass_2_kernel)
    JT->>Triton: Execute Pass 2 (cumsum all)
    Triton->>GPU: Parallel kernel execution
    GPU-->>Triton: row_id_map_pass2
    Triton-->>JT: Return results
    JT-->>JAX: row_id_map_pass2
    
    JAX->>JAX: Initialize columns [num_experts:] to -1
    
    JAX->>JT: triton_call(_row_id_map_pass_3_kernel)
    JT->>Triton: Execute Pass 3 (sparse to dense)
    Triton->>GPU: Parallel kernel execution
    GPU-->>Triton: row_id_map (final)
    Triton-->>JT: Return results
    JT-->>JAX: row_id_map
    JAX-->>User: Return row_id_map

    User->>JAX: permute_with_mask_map(inp, row_id_map, probs, ...)
    JAX->>JAX: Compute strides & create dummy tensors
    JAX->>JT: triton_call(_permute_kernel)
    JT->>Triton: Execute permutation
    Triton->>GPU: Parallel kernel execution
    GPU-->>Triton: output, permuted_probs
    Triton-->>JT: Return results
    JT-->>JAX: output, permuted_probs
    JAX-->>User: Return output, permuted_probs

    User->>JAX: unpermute_with_mask_map(inp, row_id_map, merging_probs, ...)
    JAX->>JAX: Compute strides & create dummy tensors
    JAX->>JT: triton_call(_unpermute_kernel)
    JT->>Triton: Execute unpermutation
    Triton->>GPU: Parallel kernel execution (accumulate)
    GPU-->>Triton: output, unpermuted_probs
    Triton-->>JT: Return results
    JT-->>JAX: output, unpermuted_probs
    JAX-->>User: Return output, unpermuted_probs

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mingxu1067 · 2025-11-26T18:25:24Z

Which jax-triton could work with PR?
I tried with pip install jax-triton but got TypeError: CUDABackend.make_ttir() missing 1 required positional argument: 'capability when run test_permutation.py.

To eliminate pytorch failures

tdophung · 2025-11-26T23:11:07Z

/te-ci L0 pytorch

tdophung · 2025-11-26T23:28:42Z

Which jax-triton could work with PR? I tried with pip install jax-triton but got TypeError: CUDABackend.make_ttir() missing 1 required positional argument: 'capability when run test_permutation.py.

For anyone else looking and this and would like to try:
You need to build jax-triton from source, from this commit: #2419 in order to run test_permutation.py successfully

tdophung added 2 commits November 24, 2025 16:51

cherry pick permutation from teddy/jax-triton-initial-commit

5f532ec

Signed-off-by: tdophung <tdophung@nvidia.com>

Clean up for MR

0f1e719

Signed-off-by: tdophung <tdophung@nvidia.com>

tdophung marked this pull request as draft November 25, 2025 02:16

tdophung changed the title ~~Jax wrapper for Permutation Triton kernel~~ [JAX] Wrapper for Permutation Triton kernel Nov 25, 2025

tdophung requested review from jberchtold-nvidia, mingxu1067 and phu0ngng November 25, 2025 02:17

tdophung self-assigned this Nov 25, 2025

tdophung added the MoE label Nov 25, 2025

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

Merge commit '0056b981' into teddy/jax-triton-permutation

cc34a32

To eliminate pytorch failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX] Wrapper for Permutation Triton kernel #2419

[JAX] Wrapper for Permutation Triton kernel #2419

Uh oh!

tdophung commented Nov 25, 2025 •

edited

Loading

Uh oh!

tdophung commented Nov 25, 2025

Uh oh!

greptile-apps bot commented Nov 25, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

mingxu1067 commented Nov 26, 2025

Uh oh!

tdophung commented Nov 26, 2025

Uh oh!

tdophung commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[JAX] Wrapper for Permutation Triton kernel #2419

Are you sure you want to change the base?

[JAX] Wrapper for Permutation Triton kernel #2419

Uh oh!

Conversation

tdophung commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

tdophung commented Nov 25, 2025

Uh oh!

greptile-apps bot commented Nov 25, 2025

Greptile Overview

Greptile Summary

Key Changes

Implementation Details

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

mingxu1067 commented Nov 26, 2025

Uh oh!

tdophung commented Nov 26, 2025

Uh oh!

tdophung commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tdophung commented Nov 25, 2025 •

edited

Loading