Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use subgroup operations when possible #553

Open
beaufortfrancois opened this issue Aug 20, 2024 · 18 comments
Open

Use subgroup operations when possible #553

beaufortfrancois opened this issue Aug 20, 2024 · 18 comments

Comments

@beaufortfrancois
Copy link
Contributor

beaufortfrancois commented Aug 20, 2024

Subgroups can substantially enhance performance and adaptability for machine learning tasks on GPUs. Since they're now available on origin trial, it means https://webllm.mlc.ai/ could take advantage of them.

I'm not sure what is needed yet to make it work... I assume some work in Apache TVM as well.

I highly recommend you check out the quick-start guide at https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups. For info, only subgroupBallot and subgroupBroadcast are there for now but more built-in functions such as subgroupAdd, subgroupAll, subgroupElect, subgroupShuffle will be added in a near future.

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan @tqchen What are your thoughts on this?

@tqchen
Copy link
Contributor

tqchen commented Sep 3, 2024

This is great, subgroup shuffle can be useful for reduction operations. We did have warp shuffle support for metal backend, so maybe we can try add codegen backend for webgpu

@beaufortfrancois
Copy link
Contributor Author

The following subgroup shuffle functions are actually in Chrome 129 (currently beta):

  • subgroupShuffle(value, id): Returns value from the active invocation whose subgroup_invocation_id matches id.
  • subgroupShuffleXor(value, mask): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id ^ mask. mask must be dynamically uniform.
  • subgroupShuffleUp(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id - delta.
  • subgroupShuffleDown(value, delta): Returns value from the active invocation whose subgroup_invocation_id matches subgroup_invocation_id + delta.

@beaufortfrancois
Copy link
Contributor Author

@tqchen @CharlieFRuan Is this being implemented in Apache TVM?

@CharlieFRuan
Copy link
Contributor

Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of bandwidth as of now. We'll revisit in the future!

@beaufortfrancois
Copy link
Contributor Author

According to https://groups.google.com/a/chromium.org/g/blink-dev/c/xteMk_tObgI/m/7wt8sloPDAAJ, Chrome is planning to ship subgroups in Chrome 134 (March 4th, 2025). This would be a great time to support them in WebLLM. What do you think folks?

@beaufortfrancois
Copy link
Contributor Author

(gentle ping) @tqchen @CharlieFRuan

@CharlieFRuan
Copy link
Contributor

Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels for each model, so each user downloads the same .wasm, since WebLLM does ahead-of-time compilation.

@beaufortfrancois
Copy link
Contributor Author

As you can see in https://source.chromium.org/search?q=%22EnableFeature(Feature::subgroups)%22%20f:%5Ethird_party%2Fdawn%2F&ss=chromium, not all devices support subgroups and you need to take this into account.

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("subgroups")) {
  throw new Error("Subgroups support is not available");
}
// Explicitly request subgroups support.
const device = await adapter.requestDevice({
  requiredFeatures: ["subgroups"],
});

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan Does the answer above prevent you to use subgroups in WebLLM eventually?

@CharlieFRuan
Copy link
Contributor

I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for performant devices, and one for fallbacks (the current ones).

@beaufortfrancois
Copy link
Contributor Author

That's great to hear! Please share the Apache TVM issue or PR when they're available so that we can keep track of your work and help if needed.

@beaufortfrancois
Copy link
Contributor Author

@CharlieFRuan Did you have a chance to start working on it?

@CharlieFRuan
Copy link
Contributor

Yes, I hope to get a version by the end of this week if everything goes well.

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Mar 3, 2025

Hi @beaufortfrancois! I was able to get an initial version done in TVM: apache/tvm#17699

The PR description includes what is done and not done, and the dumped kernel compiled. The remaining part is mainly about UI, since not all WebGPU devices support subgroup. For WebLLM, I'd need to compile two sets of kernels, one for devices that support subgroup (and other future performant features, e.g. a high maxComputeInvocationsPerWorkgroup), and one set for fallbacks (current kernels).

One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful. I am considering what value to set for the more performant set of WebLLM kernels. Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

So the performant set of WebGPU kernels will require WebGPU devices (as of now):

  • maxComputeInvocationsPerWorkgroup = 1k is supported
  • Support subgroup
  • subgroup_size = 32 is supported

Edit: the benchmark seems to fluctuate quite a bit, I need more rigorous benchmarking to see the speedup from subgroup, another TODO. The E2E decode speedup (in tokens per second) is typically around 1.03x from my observation

@beaufortfrancois
Copy link
Contributor Author

This is great news @CharlieFRuan! Thanks for sharing!

FYI, I'm adding the subgroups feature in @webgpu/types at gpuweb/types#167 which should help with https://github.com/apache/tvm/pull/17699/files#diff-cb3572240c47c4c62eaa4cc0e1e0cd15f88ae4c4222de860e9f63b01dc000090R113


One question I have is, typically what devices have a high maxComputeInvocationsPerWorkgroup? My M3 laptop has 1k, but IIRC it used to be 256 before. Any pointer would be helpful.

My understanding is that Chromium's maxComputeInvocationsPerWorkgroup limit varies across machines, with values of 128, 256, or 1024, based on the device's performance tier. (source)

// Tiers for limits related to workgroup size.
// TODO(crbug.com/dawn/685): Define these better. For now, use two tiers where one
// is available on nearly all desktop platforms.
//                                                             compat        tier0       tier1
#define LIMITS_WORKGROUP_SIZE(X)                                                                \
    X(Maximum,           maxComputeInvocationsPerWorkgroup,       128,         256,       1024) \

Another note is that I always use 32 for the subgroup size, same as what TVM does to Metal and CUDA.

I strongly suggest you always check subgroupMinSize and subgroupMaxSize GPU adapter info instead of 32 even though it may work for now. See this CL for more about their values in each backend.


I'm looking forward to more of your benchmarks!

@dneto0
Copy link

dneto0 commented Mar 3, 2025

The WebGPU maxComputeInvocationsPerWorkgroup limit corresponds to Vulkan's maxComputeWorkGroupInvocations limit.

See Sascha Willems' GPUinfo.or for this data on Vulkan: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all
That counts distinct GPUs reported, and is not weighted by number of users. But still it's pretty comprehensive.

1K is very common.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

But picking 32 outright is a great start.

@CharlieFRuan
Copy link
Contributor

Thanks @beaufortfrancois @dneto0 for the insights and pointers, super helpful!

1K is very common.

I see, the link is quite insightful. I'll go with 1k for the performant set of WebLLM's WebGPU kernels.

I would be more concerned about subgroup sizing tiers. My folk knowledge is 32 is ideal for NVIDIA; 64 better for AMD, 16 or 32 is good for Intel (Intel supports 8, 16, and 32, but the compiler tends to choose).

CUDA 32, AMD 64, Metal 32 are aligned with the values in TVM. Since WebLLM hosts pre-compiled kernels, and I am not sure whether using a dynamic subgroup size value when compiling with TVM is a good idea, I think I'll go with 32 for now (seems to be the more widely accepted value).

The main thing is that it may create too much complication to host a plethora of pre-compiled WebGPU kernels for WebLLM for different {subgroup_size} x {maxInvocations} x ... . I think hosting two sets of kernels (performant and fallback) is a good starting point. If the device does not support 32 subgroup size (according to subgroupMinSize and subgroupMaxSize), it will use the fallback kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants