-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use subgroup operations when possible #553
Comments
@CharlieFRuan @tqchen What are your thoughts on this? |
This is great, subgroup shuffle can be useful for reduction operations. We did have warp shuffle support for metal backend, so maybe we can try add codegen backend for webgpu |
The following subgroup shuffle functions are actually in Chrome 129 (currently beta):
|
@tqchen @CharlieFRuan Is this being implemented in Apache TVM? |
Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of bandwidth as of now. We'll revisit in the future! |
According to https://groups.google.com/a/chromium.org/g/blink-dev/c/xteMk_tObgI/m/7wt8sloPDAAJ, Chrome is planning to ship subgroups in Chrome 134 (March 4th, 2025). This would be a great time to support them in WebLLM. What do you think folks? |
(gentle ping) @tqchen @CharlieFRuan |
Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels for each model, so each user downloads the same |
As you can see in https://source.chromium.org/search?q=%22EnableFeature(Feature::subgroups)%22%20f:%5Ethird_party%2Fdawn%2F&ss=chromium, not all devices support subgroups and you need to take this into account. const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("subgroups")) {
throw new Error("Subgroups support is not available");
}
// Explicitly request subgroups support.
const device = await adapter.requestDevice({
requiredFeatures: ["subgroups"],
}); |
@CharlieFRuan Does the answer above prevent you to use subgroups in WebLLM eventually? |
I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for performant devices, and one for fallbacks (the current ones). |
That's great to hear! Please share the Apache TVM issue or PR when they're available so that we can keep track of your work and help if needed. |
@CharlieFRuan Did you have a chance to start working on it? |
Yes, I hope to get a version by the end of this week if everything goes well. |
Hi @beaufortfrancois! I was able to get an initial version done in TVM: apache/tvm#17699 The PR description includes what is done and not done, and the dumped kernel compiled. The remaining part is mainly about UI, since not all WebGPU devices support One question I have is, typically what devices have a high So the performant set of WebGPU kernels will require WebGPU devices (as of now):
Edit: the benchmark seems to fluctuate quite a bit, I need more rigorous benchmarking to see the speedup from subgroup, another TODO. The E2E decode speedup (in tokens per second) is typically around 1.03x from my observation |
This is great news @CharlieFRuan! Thanks for sharing! FYI, I'm adding the subgroups feature in @webgpu/types at gpuweb/types#167 which should help with https://github.com/apache/tvm/pull/17699/files#diff-cb3572240c47c4c62eaa4cc0e1e0cd15f88ae4c4222de860e9f63b01dc000090R113
My understanding is that Chromium's // Tiers for limits related to workgroup size.
// TODO(crbug.com/dawn/685): Define these better. For now, use two tiers where one
// is available on nearly all desktop platforms.
// compat tier0 tier1
#define LIMITS_WORKGROUP_SIZE(X) \
X(Maximum, maxComputeInvocationsPerWorkgroup, 128, 256, 1024) \
I strongly suggest you always check I'm looking forward to more of your benchmarks! |
The WebGPU maxComputeInvocationsPerWorkgroup limit corresponds to Vulkan's See Sascha Willems' GPUinfo.or for this data on Vulkan: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all 1K is very common. I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose). But picking 32 outright is a great start. |
Thanks @beaufortfrancois @dneto0 for the insights and pointers, super helpful!
I see, the link is quite insightful. I'll go with 1k for the performant set of WebLLM's WebGPU kernels.
CUDA 32, AMD 64, Metal 32 are aligned with the values in TVM. Since WebLLM hosts pre-compiled kernels, and I am not sure whether using a dynamic subgroup size value when compiling with TVM is a good idea, I think I'll go with 32 for now (seems to be the more widely accepted value). The main thing is that it may create too much complication to host a plethora of pre-compiled WebGPU kernels for WebLLM for different |
Subgroups can substantially enhance performance and adaptability for machine learning tasks on GPUs. Since they're now available on origin trial, it means https://webllm.mlc.ai/ could take advantage of them.
I'm not sure what is needed yet to make it work... I assume some work in Apache TVM as well.
I highly recommend you check out the quick-start guide at https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups. For info, only subgroupBallot and subgroupBroadcast are there for now but more built-in functions such as subgroupAdd, subgroupAll, subgroupElect, subgroupShuffle will be added in a near future.
The text was updated successfully, but these errors were encountered: