webgpu: support MultiHeadAttention operator #22144

xhcao · 2024-09-19T06:47:52Z

Description

Motivation and Context

xhcao · 2024-09-19T06:49:08Z

Although I think there are some unreasonable codes in JS MultiHeadAttention operator, I still kept the code in webgpu EP nearly the same as JS EP. Let's adjust and optimize the code in future.
@fs-eire @qjia7

fs-eire · 2024-09-19T20:02:03Z

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.h

+
+class TransferBSDToBNSHProgram final : public Program<TransferBSDToBNSHProgram> {
+ public:
+  TransferBSDToBNSHProgram(const std::string& kernel_name, bool has_bias) : Program{kernel_name}, has_bias_(has_bias) {}


Suggested change

TransferBSDToBNSHProgram(const std::string& kernel_name, bool has_bias) : Program{kernel_name}, has_bias_(has_bias) {}

TransferBSDToBNSHProgram(bool has_bias) : Program{"TransferBSDToBNSH"}, has_bias_(has_bias) {}

fs-eire · 2024-09-19T20:09:39Z

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.cc

+    shader.AddOutput("present_key", ShaderVariable::UseUniform);
+  }
+
+  shader.AppendImplementation("const TILE_SIZE = ", tile_size_, "u;\n")


it seems that cache keys for the programs are not set correctly.

for example, here tile_size_ is used as a part of the shader source code, but it is not set in the cache key. Use program.CacheHint() to set the cache key.

TILE_SIZE can also be declared in overridable constants.

fs-eire · 2024-09-19T20:12:02Z

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.cc

+  shader.AppendImplementation("var<workgroup> thread_max: array<f32, ", work_group_size_, ">;\n")
+        .AppendImplementation("var<workgroup> thread_sum: array<f32, ", work_group_size_, ">;\n");
+
+  std::string f32_str = components_ == 4 ? "vec4<f32>" : (components_ == 2 ? "vec2<f32>" : "f32");


can use x_value_t for the value type of x.

No matter x's type is f32 or f16, the program only uses f32 to define max and sum values.

fs-eire · 2024-09-19T20:40:20Z

I didn't see any call to set program cache key. This may be correct (if necessary information is already in uniform). need to confirm.

webgpu: support MultiHeadAttention operator

409ac5c

xhcao mentioned this pull request Sep 19, 2024

Multi head attention #22143

Closed

fs-eire reviewed Sep 19, 2024

View reviewed changes

Address Yulong's comments

53a29ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: support MultiHeadAttention operator #22144

webgpu: support MultiHeadAttention operator #22144

xhcao commented Sep 19, 2024

xhcao commented Sep 19, 2024

fs-eire Sep 19, 2024

xhcao Sep 20, 2024

fs-eire Sep 19, 2024

fs-eire Sep 19, 2024

xhcao Sep 20, 2024

fs-eire Sep 19, 2024

xhcao Sep 20, 2024

fs-eire commented Sep 19, 2024

	TransferBSDToBNSHProgram(const std::string& kernel_name, bool has_bias) : Program{kernel_name}, has_bias_(has_bias) {}
	TransferBSDToBNSHProgram(bool has_bias) : Program{"TransferBSDToBNSH"}, has_bias_(has_bias) {}

webgpu: support MultiHeadAttention operator #22144

Are you sure you want to change the base?

webgpu: support MultiHeadAttention operator #22144

Conversation

xhcao commented Sep 19, 2024

Description

Motivation and Context

xhcao commented Sep 19, 2024

fs-eire Sep 19, 2024

Choose a reason for hiding this comment

xhcao Sep 20, 2024

Choose a reason for hiding this comment

fs-eire Sep 19, 2024

Choose a reason for hiding this comment

fs-eire Sep 19, 2024

Choose a reason for hiding this comment

xhcao Sep 20, 2024

Choose a reason for hiding this comment

fs-eire Sep 19, 2024

Choose a reason for hiding this comment

xhcao Sep 20, 2024

Choose a reason for hiding this comment

fs-eire commented Sep 19, 2024