[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

huanghua1994 · 2025-09-29T21:50:47Z

Description

This is a draft PR for saving some work and discussion.

Recently we used TE/JAX's grouped_gemm() interface for a MoE model's inference. Nsys shows a GPU bubble when grouped_gemm() is copying the group_sizes array from device to host. This is a known issue when we were designing the grouped_gemm() interface. It's performance impact for training / inference prefill stage is relatively small but cannot be ignored in inference decode stage. This draft aims to partially address the bubble issue.

Our target model uses MLP-MoE, i.e., each expert is a MLP layer. After fusing GEMMs, each MLP-MoE layer needs two grouped_gemm() with the same group_sizes array. This PR allows issuing an async D2H copy of the group_size array before entering grouped_gemm(), then grouped_gemm() can reuse the downloaded group_sizes. We have validated the correctness of the implementation in this PR in our target model.

This PR does not solve the issue of breaking CUDA graph in grouped_gemm() since in the async copy mode it still needs to call cudaEventSynchronize(). Furthermore, the D2H memcpy does not overlap with other operations for copying and dispatching tokens to experts in our implementation for the target model, since those JAX-native operations are captured and executed in CUDA graph, while the async D2H copy does not support CUDA graph.

@phu0ngng @mingxu1067 Please let me know your comments and suggestions. Much appreciated!

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added GroupedGemmCopySizesPrimitive for async copying of group_sizes from GPU to host
Added optional argument use_async_d2h_group_sizes for grouped_gemm(), the default value is False so the original code path will be used by default

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

phu0ngng

I think it is a good improvement for now.

We should probably provide a GroupedLayerNormMLP VJP op, which encloses the grouped_gemm_copy_group_sizes function and the use_async_d2h_group_sizes option so that we don't expose these two to users as they can be pretty bug-prone.

phu0ngng · 2025-09-30T13:54:30Z

transformer_engine/jax/csrc/extensions/gemm.cpp

+               "supported number ", max_num_gemms, " to be downloaded in advance.");
+    host_num_gemms = num_gemms;
+    // Wait for current compute stream to finish
+    cudaStream_t compute_stream_0 = nvte_get_compute_stream(0);


@mingxu1067 could you check if this causes the same stream sync issue as last time when we used the compute_stream(0) instead of the stream given by XLA?

Just a note: this part follows the logic in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/gemm/cublaslt_gemm.cu#L915

phu0ngng · 2025-09-30T13:55:32Z

transformer_engine/jax/csrc/extensions/gemm.cpp

+  auto init = [&]() {
+    NVTE_CHECK_CUDA(cudaEventCreate(&d2h_event));
+    NVTE_CHECK_CUDA(cudaMallocHost(&host_group_sizes_internal, sizeof(int32_t) * max_num_gemms));
+  };


If this causes any issues, we could consider moving this allocation into the FFI prepare phase.

Signed-off-by: Hua Huang <huah@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Hua Huang <huah@nvidia.com>

phu0ngng · 2025-10-03T17:02:59Z

/te-ci JAX L0

huanghua1994 requested review from mingxu1067 and phu0ngng September 29, 2025 21:50

phu0ngng previously approved these changes Sep 30, 2025

View reviewed changes

huanghua1994 and others added 4 commits October 3, 2025 08:18

Try async copy of grouped GEMM group_sizes data

21ea9a3

Signed-off-by: Hua Huang <huah@nvidia.com>

Fix primitive config and stream sync

b0de623

Signed-off-by: Hua Huang <huah@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

63d2832

for more information, see https://pre-commit.ci

Rebase and minor modification

25a15cd

Signed-off-by: Hua Huang <huah@nvidia.com>

huanghua1994 dismissed phu0ngng’s stale review via 25a15cd October 3, 2025 15:22

huanghua1994 force-pushed the grouped-gemm-async-size-copy branch from a7c4ab7 to 25a15cd Compare October 3, 2025 15:22

huanghua1994 marked this pull request as ready for review October 3, 2025 15:22

Merge branch 'main' into grouped-gemm-async-size-copy

90ac202

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Uh oh!

huanghua1994 commented Sep 29, 2025

Uh oh!

phu0ngng left a comment

Uh oh!

phu0ngng Sep 30, 2025

Uh oh!

huanghua1994 Sep 30, 2025

Uh oh!

phu0ngng Sep 30, 2025

Uh oh!

phu0ngng commented Oct 3, 2025

Uh oh!

Uh oh!

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Are you sure you want to change the base?

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Uh oh!

Conversation

huanghua1994 commented Sep 29, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

phu0ngng left a comment

Choose a reason for hiding this comment

Uh oh!

phu0ngng Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

huanghua1994 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented Oct 3, 2025

Uh oh!

Uh oh!