-
Notifications
You must be signed in to change notification settings - Fork 684
[Intel HPU] enable MoE EP for hpu #5855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables MoE (Mixture of Experts) Expert Parallelism (EP) for Intel HPU by modifying the execution path and weight handling to accommodate HPU-specific requirements.
Key changes:
- Modified MoE forward logic to route HPU through
forward_normalregardless of EP/TP configuration - Converted
down_proj_in_scalefrom list to tensor and added padding alignment for HPU's 0x80 byte alignment requirement - Added
up_gate_proj.activation_scaleweight loading support for EP mode
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/moe/moe.py | Routes HPU platform to use forward_normal path for both EP and TP modes |
| fastdeploy/model_executor/layers/backends/intel_hpu/moe/fused_moe_hpu_backend.py | Changes down_proj_in_scale handling from list to tensor and renames apply_tp to apply |
| fastdeploy/worker/hpu_model_runner.py | Adds alignment padding function for scales and implements early return for EP mode |
| fastdeploy/model_executor/load_weight_utils.py | Adds up_gate_proj_in_scale_key to weight loading for EP support |
| examples/intel_hpu/offline_demo.py | Enables EP configuration in demo script |
fastdeploy/model_executor/layers/backends/intel_hpu/moe/fused_moe_hpu_backend.py
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5855 +/- ##
==========================================
Coverage ? 67.32%
==========================================
Files ? 347
Lines ? 44642
Branches ? 6879
==========================================
Hits ? 30055
Misses ? 12368
Partials ? 2219
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6abc6d1 to
881a762
Compare
62d4ac2 to
881a762
Compare
2b4a238 to
881a762
Compare
31f8e6f to
5af9f09
Compare
|
add @LeoZhao-Intel @fmiao2372 |
5af9f09 to
e1b2940
Compare
LeoZhao-Intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the quantization naming convention currently used in the open-source ERNIE 4.5 series models. We are not certain whether it fully covers your use case. If it does, we recommend aligning the naming of the quantized weights with this convention.
-
Quantized weights: Append the suffix
quant_weightto the layer name (i.e., replaceweightwithquant_weight), for example:
ernie.layers.1.mlp.down_proj.quant_weight -
Scale for quantized weights: Append the suffix
.weight_scale, for example:
ernie.layers.1.mlp.down_proj.weight_scale -
Activation scale after quantization: Append the suffix
.activation_scale, for example:
ernie.layers.1.mlp.down_proj.activation_scale -
Smooth scale (not applicable to ERNIE 4.5T at the moment): Append the suffix
smooth_scale, for example:
ernie.layers.1.mlp.down_proj.smooth_scale -
Shift bias (not applicable to ERNIE 4.5T at the moment): Append the suffix
shift_bias, for example:
ernie.layers.1.mlp.down_proj.shift_bias -
Cache KV scale: Append the suffixes
.cachek_matmul.activation_scaleand.cachev_matmul.activation_scaleto K and V, respectively, for example:
ernie.layers.0.self_attn.cachek_matmul.activation_scale
ernie.layers.0.self_attn.cachev_matmul.activation_scale -
Cache KV zero point: Based on item 6, replace
scalewithzero_point, for example:
ernie.layers.0.self_attn.cachek_matmul.activation_zero_point
e1b2940 to
9cd79a0
Compare
|
Fixed.
@bukejiyu commented on this pull request.
________________________________
In fastdeploy/model_executor/models/ernie4_5_moe.py<#5855 (comment)>:
@@ -558,10 +558,14 @@ def load_weights(self, weights_iterator) -> None:
("qkv_proj", "v_proj", None, "v"),
("up_gate_proj", "gate_proj", None, "gate"),
("up_gate_proj", "up_proj", None, "up"),
- ("attn.cache_k_scale", "cachek_matmul.activation_scale", None, None),
- ("attn.cache_v_scale", "cachev_matmul.activation_scale", None, None),
+ ("attn.cache_k_scale", "cachek_matmul.in_scale", None, None),
这份权重是打算开源出去的么?可以先把safetensor文件中的 cachev_matmul.in_scale 改成 cachev_matmul.activation_scale
|
| ("up_gate_proj", "up_proj", None, "up"), | ||
| ("attn.cache_k_scale", "cachek_matmul.activation_scale", None, None), | ||
| ("attn.cache_v_scale", "cachev_matmul.activation_scale", None, None), | ||
| ("attn.cache_k_scale", "cachek_matmul.in_scale", None, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样修改,会导致开源的ernie模型无法加载了,这块建议不修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
前一版本的修改,会在checkpoint_to_fd_key_fn里先统一把activation_scale改成in_scale,然后在做loaded_weight_name.replace的时候是可以按照新的cachek_matmul.in_scale识别的。
为了不修改当前实现,已经撤回这部分改动。目前版本把 checkpoint_to_fd_key_fn 里面的替换去掉了,还是使用 cachek_matmul.activation_scale。
| ("attn.cache_v_scale", "cachev_matmul.in_scale", None, None), | ||
| ("attn.cache_k_zp", "cachek_matmul.activation_zero_point", None, None), | ||
| ("attn.cache_v_zp", "cachev_matmul.activation_zero_point", None, None), | ||
| ("act_scale", "in_scale", None, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
act_scale/attn.q_scale/attn.s_scale/up_gate_proj_in_scale这些分别代表什么意义呢,目前fd都以weight_scale/activation_scale 加layername去命名
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
act_scale对应mlp. & mlp.shared_experts.:
down_proj.activation_scale --> down_proj.act_scale
up_gate_proj.activation_scale --> up_gate_proj.act_scale
attn.q_scale/attn.s_scale 类似 attn.cache_k_scale / attn.cache_v_scale
up_gate_proj_in_scale 对应 mlp.experts..:
experts.{exp_id}.up_gate_proj.activation_scale --> experts.up_gate_proj_in_scale
最后这个所有experts共用一个activation_scale,所以没有放在 make_expert_params_mapping 里。
| ("attn.cache_k_zp", "cachek_matmul.activation_zero_point", None, None), | ||
| ("attn.cache_v_zp", "cachev_matmul.activation_zero_point", None, None), | ||
| ("act_scale", "in_scale", None, None), | ||
| ("attn.q_scale", "q_matmul.in_scale", None, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
act_scale/attn.q_scale/attn.s_scale/up_gate_proj_in_scale这些分别代表什么意义呢,目前fd都以weight_scale/activation_scale 加layername去命名,需要讨论下规范格式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention 里面的 SDPA 和 MLP / MoE 里面的 up/gate/down proj 这几个部分matmul都是用的 tensor_wise_fp8,所以他们都需要各自的activation_scale。
目前的FD只提供了 K 和 V 的activation_scale,给KV_cache用。我们SDPA在做 QKT 和 SV 两部分矩阵乘的时候,Q, K, V, S这4个都是需要的,但是Q和S又不能叫cache_{q/s}_scale,所以就只保留了attn.q_scale/attn.s_scale.
up/gate/down部分,普通的MLP和share_experts部分,FD只把activation_scale 改成了 act_scale
MoE 的 expert部分,down_proj.activation_scale 去掉exper_id后,连带着下划线一起改成了down_proj_in_scale, 与FD目前的命名规则一致。
我们的MoE up_gate部分,所有的expert共用一个activation_scale,所以把up_gate_proj.activation_scale单独放在了上面,作为up_gate_proj_in_scale
MoE部分的命名规则与 fused_moe_backend_base.py 及其他厂家一致,没有使用新的名称。只是这部分重命名规则在V1里面缺失。
xiaoluomi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
act_scale/attn.q_scale/attn.s_scale/up_gate_proj_in_scale这些分别代表什么意义呢,目前fd都以weight_scale/activation_scale 加layername去命名,需要讨论下规范格式
我这边注意到很多quant attn里的描述是把你这里的S attn weights 描述为P,然后做PV计算。 |
EmmonsCurse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for HPU
Motivation
enable MoE EP for hpu with loader_v1
Modifications
fastdeploy/model_executor/layers/moe/moe.py
HPU calls forward_normal no matter EP or TP, and won't fall into forward_split_allgather nor forward_chunked_moe
fused_moe_hpu_backend.py
change down_proj_in_scale from list to tensor.
hpu_model_runner.py
list to tensor, add padding dim for 0x80 alignment request.
fastdeploy/model_executor/load_weight_utils.py
needs up_gate_proj.activation_scale for EP in loader v0
fastdeploy/model_executor/models/ernie4_5_moe.py
add Attention related activation_scale name conversions
Usage or Command
set
enable_expert_parallel=True, anddisable_sequence_parallel_moe=True, to enable HPU MoE EP.Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.conducted by local tests
releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.