Use subgroup operations when possible #553

beaufortfrancois · 2024-08-20T11:30:40Z

Subgroups can substantially enhance performance and adaptability for machine learning tasks on GPUs. Since they're now available on origin trial, it means https://webllm.mlc.ai/ could take advantage of them.

I'm not sure what is needed yet to make it work... I assume some work in Apache TVM as well.

I highly recommend you check out the quick-start guide at https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups. For info, only subgroupBallot and subgroupBroadcast are there for now but more built-in functions such as subgroupAdd, subgroupAll, subgroupElect, subgroupShuffle will be added in a near future.

beaufortfrancois · 2024-09-03T09:54:08Z

@CharlieFRuan @tqchen What are your thoughts on this?

tqchen · 2024-09-03T14:49:46Z

This is great, subgroup shuffle can be useful for reduction operations. We did have warp shuffle support for metal backend, so maybe we can try add codegen backend for webgpu

beaufortfrancois · 2024-09-03T15:36:26Z

The following subgroup shuffle functions are actually in Chrome 129 (currently beta):

subgroupShuffle(value, id): Returns value from the active invocation whose subgroup_invocation_id matches id.
subgroupShuffleXor(value, mask): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id ^ mask. mask must be dynamically uniform.
subgroupShuffleUp(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id - delta.
subgroupShuffleDown(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id + delta.

beaufortfrancois · 2024-09-09T06:39:48Z

@tqchen @CharlieFRuan Is this being implemented in Apache TVM?

CharlieFRuan · 2024-09-10T17:35:14Z

Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of bandwidth as of now. We'll revisit in the future!

beaufortfrancois · 2025-01-13T10:08:42Z

According to https://groups.google.com/a/chromium.org/g/blink-dev/c/xteMk_tObgI/m/7wt8sloPDAAJ, Chrome is planning to ship subgroups in Chrome 134 (March 4th, 2025). This would be a great time to support them in WebLLM. What do you think folks?

beaufortfrancois · 2025-01-17T09:03:53Z

(gentle ping) @tqchen @CharlieFRuan

CharlieFRuan · 2025-01-21T21:51:37Z

Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels for each model, so each user downloads the same .wasm, since WebLLM does ahead-of-time compilation.

beaufortfrancois · 2025-01-22T07:27:42Z

As you can see in https://source.chromium.org/search?q=%22EnableFeature(Feature::subgroups)%22%20f:%5Ethird_party%2Fdawn%2F&ss=chromium, not all devices support subgroups and you need to take this into account.

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("subgroups")) {
  throw new Error("Subgroups support is not available");
}
// Explicitly request subgroups support.
const device = await adapter.requestDevice({
  requiredFeatures: ["subgroups"],
});

beaufortfrancois · 2025-02-03T07:36:10Z

@CharlieFRuan Does the answer above prevent you to use subgroups in WebLLM eventually?

CharlieFRuan · 2025-02-05T16:04:01Z

I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for performant devices, and one for fallbacks (the current ones).

beaufortfrancois · 2025-02-06T10:47:34Z

That's great to hear! Please share the Apache TVM issue or PR when they're available so that we can keep track of your work and help if needed.

beaufortfrancois · 2025-02-24T10:38:56Z

@CharlieFRuan Did you have a chance to start working on it?

CharlieFRuan · 2025-02-25T05:01:56Z

Yes, I hope to get a version by the end of this week if everything goes well.

CharlieFRuan · 2025-03-03T04:39:10Z

Hi @beaufortfrancois! I was able to get an initial version done in TVM: apache/tvm#17699

The PR description includes what is done and not done, and the dumped kernel compiled. The remaining part is mainly about UI, since not all WebGPU devices support subgroup. For WebLLM, I'd need to compile two sets of kernels, one for devices that support subgroup (and other future performant features, e.g. a high maxComputeInvocationsPerWorkgroup), and one set for fallbacks (current kernels).

One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful. I am considering what value to set for the more performant set of WebLLM kernels. Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

So the performant set of WebGPU kernels will require WebGPU devices (as of now):

maxComputeInvocationsPerWorkgroup = 1k is supported
Support subgroup
subgroup_size = 32 is supported

Edit: the benchmark seems to fluctuate quite a bit, I need more rigorous benchmarking to see the speedup from subgroup, another TODO. The E2E decode speedup (in tokens per second) is typically around 1.03x from my observation

beaufortfrancois · 2025-03-03T14:11:42Z

This is great news @CharlieFRuan! Thanks for sharing!

FYI, I'm adding the subgroups feature in @webgpu/types at gpuweb/types#167 which should help with https://github.com/apache/tvm/pull/17699/files#diff-cb3572240c47c4c62eaa4cc0e1e0cd15f88ae4c4222de860e9f63b01dc000090R113

One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful.

My understanding is that Chromium's maxComputeInvocationsPerWorkgroup limit varies across machines, with values of 128, 256, or 1024, based on the device's performance tier. (source)

// Tiers for limits related to workgroup size.
// TODO(crbug.com/dawn/685): Define these better. For now, use two tiers where one
// is available on nearly all desktop platforms.
//                                                             compat        tier0       tier1
#define LIMITS_WORKGROUP_SIZE(X)                                                                \
    X(Maximum,           maxComputeInvocationsPerWorkgroup,       128,         256,       1024) \

Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

I strongly suggest you always check subgroupMinSize and subgroupMaxSize GPU adapter info instead of 32 even though it may work for now. See this CL for more about their values in each backend.

I'm looking forward to more of your benchmarks!

dneto0 · 2025-03-03T14:34:28Z

The WebGPU maxComputeInvocationsPerWorkgroup limit corresponds to Vulkan's maxComputeWorkGroupInvocations limit.

See Sascha Willems' GPUinfo.or for this data on Vulkan: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all
That counts distinct GPUs reported, and is not weighted by number of users. But still it's pretty comprehensive.

1K is very common.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

But picking 32 outright is a great start.

CharlieFRuan · 2025-03-03T18:26:59Z

Thanks @beaufortfrancois @dneto0 for the insights and pointers, super helpful!

1K is very common.

I see, the link is quite insightful. I'll go with 1k for the performant set of WebLLM's WebGPU kernels.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

CUDA 32, AMD 64, Metal 32 are aligned with the values in TVM. Since WebLLM hosts pre-compiled kernels, and I am not sure whether using a dynamic subgroup size value when compiling with TVM is a good idea, I think I'll go with 32 for now (seems to be the more widely accepted value).

The main thing is that it may create too much complication to host a plethora of pre-compiled WebGPU kernels for WebLLM for different {subgroup_size} x {maxInvocations} x ... . I think hosting two sets of kernels (performant and fallback) is a good starting point. If the device does not support 32 subgroup size (according to subgroupMinSize and subgroupMaxSize), it will use the fallback kernels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use subgroup operations when possible #553

Use subgroup operations when possible #553

beaufortfrancois commented Aug 20, 2024 •

edited

Loading

beaufortfrancois commented Sep 3, 2024

tqchen commented Sep 3, 2024

beaufortfrancois commented Sep 3, 2024

beaufortfrancois commented Sep 9, 2024

CharlieFRuan commented Sep 10, 2024

beaufortfrancois commented Jan 13, 2025

beaufortfrancois commented Jan 17, 2025

CharlieFRuan commented Jan 21, 2025

beaufortfrancois commented Jan 22, 2025

beaufortfrancois commented Feb 3, 2025

CharlieFRuan commented Feb 5, 2025

beaufortfrancois commented Feb 6, 2025

beaufortfrancois commented Feb 24, 2025

CharlieFRuan commented Feb 25, 2025

CharlieFRuan commented Mar 3, 2025 •

edited

Loading

beaufortfrancois commented Mar 3, 2025

dneto0 commented Mar 3, 2025

CharlieFRuan commented Mar 3, 2025

Use subgroup operations when possible #553

Use subgroup operations when possible #553

Comments

beaufortfrancois commented Aug 20, 2024 • edited Loading

beaufortfrancois commented Sep 3, 2024

tqchen commented Sep 3, 2024

beaufortfrancois commented Sep 3, 2024

beaufortfrancois commented Sep 9, 2024

CharlieFRuan commented Sep 10, 2024

beaufortfrancois commented Jan 13, 2025

beaufortfrancois commented Jan 17, 2025

CharlieFRuan commented Jan 21, 2025

beaufortfrancois commented Jan 22, 2025

beaufortfrancois commented Feb 3, 2025

CharlieFRuan commented Feb 5, 2025

beaufortfrancois commented Feb 6, 2025

beaufortfrancois commented Feb 24, 2025

CharlieFRuan commented Feb 25, 2025

CharlieFRuan commented Mar 3, 2025 • edited Loading

beaufortfrancois commented Mar 3, 2025

dneto0 commented Mar 3, 2025

CharlieFRuan commented Mar 3, 2025

beaufortfrancois commented Aug 20, 2024 •

edited

Loading

CharlieFRuan commented Mar 3, 2025 •

edited

Loading