-
Notifications
You must be signed in to change notification settings - Fork 685
[Intel HPU] enable MoE EP for hpu #5855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -565,6 +565,12 @@ def load_weights(self, weights_iterator) -> None: | |
| ("attn.cache_v_scale", "cachev_matmul.activation_scale", None, None), | ||
| ("attn.cache_k_zp", "cachek_matmul.activation_zero_point", None, None), | ||
| ("attn.cache_v_zp", "cachev_matmul.activation_zero_point", None, None), | ||
| ("act_scale", "in_scale", None, None), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. act_scale/attn.q_scale/attn.s_scale/up_gate_proj_in_scale这些分别代表什么意义呢,目前fd都以weight_scale/activation_scale 加layername去命名
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. act_scale对应mlp. & mlp.shared_experts.: attn.q_scale/attn.s_scale 类似 attn.cache_k_scale / attn.cache_v_scale up_gate_proj_in_scale 对应 mlp.experts..: |
||
| ("attn.q_scale", "q_matmul.in_scale", None, None), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. act_scale/attn.q_scale/attn.s_scale/up_gate_proj_in_scale这些分别代表什么意义呢,目前fd都以weight_scale/activation_scale 加layername去命名,需要讨论下规范格式
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. attention 里面的 SDPA 和 MLP / MoE 里面的 up/gate/down proj 这几个部分matmul都是用的 tensor_wise_fp8,所以他们都需要各自的activation_scale。 目前的FD只提供了 K 和 V 的activation_scale,给KV_cache用。我们SDPA在做 QKT 和 SV 两部分矩阵乘的时候,Q, K, V, S这4个都是需要的,但是Q和S又不能叫cache_{q/s}_scale,所以就只保留了attn.q_scale/attn.s_scale. up/gate/down部分,普通的MLP和share_experts部分,FD只把activation_scale 改成了 act_scale MoE 的 expert部分,down_proj.activation_scale 去掉exper_id后,连带着下划线一起改成了down_proj_in_scale, 与FD目前的命名规则一致。 我们的MoE up_gate部分,所有的expert共用一个activation_scale,所以把up_gate_proj.activation_scale单独放在了上面,作为up_gate_proj_in_scale MoE部分的命名规则与 fused_moe_backend_base.py 及其他厂家一致,没有使用新的名称。只是这部分重命名规则在V1里面缺失。 |
||
| ("attn.s_scale", "s_matmul.in_scale", None, None), | ||
| ("attn.cache_k_scale", "cachek_matmul.in_scale", None, None), | ||
| ("attn.cache_v_scale", "cachev_matmul.in_scale", None, None), | ||
| ("up_gate_proj_in_scale", "up_gate_proj.in_scale", None, None), | ||
| ] | ||
|
|
||
| expert_params_mapping = [] | ||
|
|
@@ -590,7 +596,10 @@ def load_weights(self, weights_iterator) -> None: | |
| (param, weight, exp, shard, False) for param, weight, exp, shard in general_params_mapping | ||
| ] + [(param, weight, exp, shard, True) for param, weight, exp, shard in expert_params_mapping] | ||
| checkpoint_to_fd_key_fn = rename_offline_ckpt_suffix_to_fd_suffix( | ||
| fd_config=self.fd_config, ckpt_weight_suffix="quant_weight", ckpt_scale_suffix="weight_scale" | ||
| fd_config=self.fd_config, | ||
| ckpt_weight_suffix="quant_weight", | ||
| ckpt_scale_suffix="weight_scale", | ||
| ckpt_act_suffix="activation_scale", | ||
| ) | ||
| params_dict = dict(self.named_parameters()) | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.