webgpu / nbitmm support for bias and weight_index #26392

guschmue · 2025-10-23T02:17:15Z

add support for bias and weight_index, move subgroup_matrix_matmul_nbits to template and make program callable from other ops.

…its to template and make program callable from other ops.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

Copilot

Pull Request Overview

Adds WebGPU support for bias and weight_index parameters to N-bit matrix multiplication operations, enabling features like stacked weights and bias addition in quantized operations.

Key changes:

Extended matmul_nbits operations to support optional bias parameter across multiple implementations (DP4A, wide tile, subgroup matrix)
Added weight_index uniform variable to enable weight stacking and offset computation in quantized matmul
Refactored Apple-specific shader generation to use WGSL templates instead of inline string concatenation

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
matmul_4bits_test.cc	Added test case for WebGPU with bias support
subgroup_matrix_matmul_nbits_apple.wgsl.template	New template file consolidating Apple shader generation with bias support
subgroup_matrix_matmul_nbits.h	Added has_bias and weight_idx parameters to program interface
subgroup_matrix_matmul_nbits.cc	Refactored to use template system and added bias/weight_index support
matmul_nbits_zero_pt.wgsl.template	Added has_bias parameter declaration
matmul_nbits_wide_tile.wgsl.template	Implemented bias addition and weight offset calculations
matmul_nbits.h	Added has_bias parameter and exposed ApplyMatMulNBits function
matmul_nbits.cc	Removed bias constraint, added ApplyMatMulNBits function with extensive documentation
dp4a_matmul_small_m.wgsl.template	Added bias support with offset calculations
dp4a_matmul_nbits.h	Added has_bias and weight_index parameters to program interfaces
dp4a_matmul_nbits.cc	Integrated bias support across DP4A implementations
dp4a_matmul_common.wgsl.template	Added has_bias parameter declaration
dp4a_matmul.wgsl.template	Implemented bias addition with vectorized operations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_wide_tile.wgsl.template

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

qjia7

In current change, it seems that ApplyMatMulNBits calculate a with one weight index of b. For QMoE case, it only compute one expert. I remember you said you need 4 selected experts. So will ApplyMatMulNBits be called 4 times to get the up projection results? Why not directly generate the result by calling ApplyMatMulNBits once?

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_common.wgsl.template

qjia7 · 2025-10-27T03:26:04Z

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_small_m.wgsl.template

    let zero = mm_read_zero(0, 0, uniforms.N, uniforms.zero_blocks_per_col);
-    let own_scale_b = scales_b.getByOffset(0);
+    let b_scale_offset = uniforms.weight_idx * uniforms.N * (uniforms.K / uniforms.block_size);
+    let own_scale_b = scales_b.getByOffset(b_scale_offset);


I think for single_scale_weights, you can directly use let own_scale_b = scales_b.getByOffset(uniforms.weight_idx);.

qjia7 · 2025-10-27T05:13:04Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

+ * @param accuracy_level Accuracy level influencing the choice of optimized kernel.
+ * @param nbits          Number of bits used for quantization.
+ * @param context        Compute context for WebGPU, providing device-specific information and execution facilities.
+ * @param y              Pointer to the output tensor that will hold the result.


Please add description for param weight_index which is used to specify which index of batch in b to participant into the calculation.

qjia7 · 2025-10-27T05:20:22Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

+ *
+ * @param a              Pointer to the left-hand side (activation) tensor.
+ * @param b              Pointer to the quantized weight tensor.
+ * @param scales         Pointer to the tensor containing scaling factors for quantization.


I assume the b's shape will be (weight_batch, N, k_blocks, blob_size) instead of (N, k_blocks, blob_size). For MatMulNBits operator, weight_batch is 1. And for custom scenarios like QMoe, weight_batch is num_experts. So it will be good to have some description about this parameter in case others modify this file in future.
And for scales, similarly, the shape of it is (weight_batch, N) instead of N.

qjia7 · 2025-10-27T05:37:44Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

+ * @param a              Pointer to the left-hand side (activation) tensor.
+ * @param b              Pointer to the quantized weight tensor.
+ * @param scales         Pointer to the tensor containing scaling factors for quantization.
+ * @param zero_points    Pointer to the zero-point tensor for quantization; must be of type uint8 if provided.


For zero point, currently, weight_index is not used since QMoe is using symmetric quantization. Please add comment here that weight_batch is not supported in zero_points.

qjia7 · 2025-10-27T05:43:35Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

                        4});
    }
+    if (has_bias) {
+      program.AddInput({bias, ProgramTensorMetadataDependency::TypeAndRank});


Suggested change

program.AddInput({bias, ProgramTensorMetadataDependency::TypeAndRank});

program.AddInput({bias, ProgramTensorMetadataDependency::None});

qjia7 · 2025-10-27T05:44:16Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

    program.AddInput({zero_points, ProgramTensorMetadataDependency::None, {(zero_points->Shape().Size() + 3) / 4}, 4});
  }
+  if (has_bias) {
+    program.AddInput({bias, ProgramTensorMetadataDependency::TypeAndRank});


Suggested change

program.AddInput({bias, ProgramTensorMetadataDependency::TypeAndRank});

program.AddInput({bias, ProgramTensorMetadataDependency::None});

qjia7 · 2025-10-27T05:57:12Z

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.wgsl.template

 #if !single_scale_weights
        let block_idx = (kidx + idx * elements_in_value_b) / uniforms.block_size;
-        let scale_b = scales_b.getByOffset(b_global * uniforms.blocks_per_col + block_idx);
+        let scale_b = scales_b.getByOffset(b_global * uniforms.blocks_per_col + block_idx + b_scale_offset);


Please also update the scale_b in single_scale_weights path in line 48 (let scale_b = scales_b.getByOffset(uniforms.weight_idx);).

qjia7 · 2025-10-27T06:09:45Z

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template

        }

-        let b_value = b.getByOffset(b_global*uniforms.K16+kidx_v+col);
+        let b_weight_offset = uniforms.weight_idx * uniforms.N * uniforms.K16;


nit: Since b_weight_offset and b_scale_offset are constant value, will it be better to calculate them in cpu and write into uniform? In shader, we can always read them from uniform.

could do that but I intentionally did not because there are 32 experts so we'd need to compile 32 shaders.
But I was thinking for weight_idx == 0 I could do a #if in the template and 'const b_weight_offset = 0' so everything not QMoE would benefit of const and for QMoE we'd need to compile 2 shaders.

I made a change to use const for weight_idx related offset if weight_idx == 0. So only QMoE takes a tiny hit for the weight_idx.

onnxruntime/test/contrib_ops/matmul_4bits_test.cc

guschmue · 2025-10-27T16:11:35Z

In current change, it seems that ApplyMatMulNBits calculate a with one weight index of b. For QMoE case, it only compute one expert. I remember you said you need 4 selected experts. So will ApplyMatMulNBits be called 4 times to get the up projection results? Why not directly generate the result by calling ApplyMatMulNBits once?
Earlier on I has a QMoE implementation that does 4 experts in 1 shot but in that case I need to go token by token and the prefill would get pretty costly.
The current implementation (

onnxruntime/onnxruntime/contrib_ops/webgpu/moe/qmoe.cc

Line 198 in 3c65aaa

for (int token_offset = 0; token_offset < moe_params.num_rows; token_offset += max_tokens) {

)
does it different. I looks at all tokens and assigns them to experts, than walks over the experts and runs them 1 by one if there are tokens assigned to them. I think this is more in line with other implementations and should be better for prefill performance.
If you have say 1000 tokens it is most likely we need to run all experts, for generation we'd run 4.
Also helps with memory usage.

qjia7

The current code changes are clean. No more questions for the nbitmm.

If you have say 1000 tokens it is most likely we need to run all experts, for generation we'd run 4.

Does that mean that for generation, we can do specific optimization to calculate the four experts in one ApplyMatMulNBits?

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template

qjia7 · 2025-10-28T10:23:08Z

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template

 #endif
 #else
 #if has_bias
+        // TODO: wanted to use vec4 for bias but for some reason that fails ut. Later.


Using vec4 for bias, you need to make sure N % 4 == 0 or it will be very complicated to re-arrange to get the correct vec4 data.

we can sit it out

add support for bias and weight_index, move subgroup_matrix_matmul_nb…

f5e8c86

…its to template and make program callable from other ops.

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 23, 2025

github-actions bot reviewed Oct 23, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

guschmue added 3 commits October 22, 2025 19:26

make lint happy

36c3a23

fix some spelling

6e03276

fix build

cab3663

guschmue requested a review from Copilot October 23, 2025 16:25

Copilot AI reviewed Oct 23, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_wide_tile.wgsl.template Show resolved Hide resolved

guschmue and others added 2 commits October 23, 2025 09:49

Update onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

36d12fb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

make gcc happy

897c7b7

guschmue marked this pull request as ready for review October 23, 2025 23:50

qjia7 reviewed Oct 27, 2025

View reviewed changes

guschmue added 2 commits October 27, 2025 20:11

address review comments

273a200

enable ut for webgpu / bias

91801e7

qjia7 reviewed Oct 28, 2025

View reviewed changes

review feedback

34d736f

	program.AddInput({bias, ProgramTensorMetadataDependency::TypeAndRank});
	program.AddInput({bias, ProgramTensorMetadataDependency::None});

Uh oh!

webgpu / nbitmm support for bias and weight_index #26392

Are you sure you want to change the base?

webgpu / nbitmm support for bias and weight_index #26392

Uh oh!

Conversation

guschmue commented Oct 23, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guschmue commented Oct 27, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants