Skip to content

Conversation

@shangyuan-ant
Copy link

@shangyuan-ant shangyuan-ant commented Sep 19, 2025

Motivation

The natively implemented EPLB algorithm primarily focuses on balancing the computational load across each GPU and machine but does not adequately account for inter-expert communication (such as cross-node communication). In large-scale expert parallelism scenarios, excessive cross-node communication is more likely to compromise computational efficiency.

Modifications

Building upon expert load tracking, we further record the top-k expert groups activated in each iteration to compute an expert affinity matrix (i.e., the probability of co-activation). After intra-card load balancing via EPLB, we adjust card placement based on the affinity between the expert with the highest load in one gpu and other experts within other gpus, thereby reducing subsequent cross-node communication. This approach can achieve an additional ~5% performance improvement over standard EPLB.

Accuracy Tests

Benchmarking and Profiling

■ request-rate = 5 | max-concurrency(batch-size) = (512 896 1024 1536 2048)
■ num-prompts = 4096 | input-len = 4096 | output-len = 1536
■ dataset: ShareGPT_V3_unfiltered_cleaned_split.json
batch-size Performance W/o EPLB With EPLB(vanilla) With EPLB(Expert-Affinity Aware)
64 P50-TTFT 566.78 540.49 559.74
P90-TPOT 45.02 44.95 44.94
QPS 1.35 1.36 1.36
128 P50-TTFT 539.93 537.22 541.04
P90-TPOT 49.18 49.10 49.10
QPS 2.36 2.36 2.36
256 P50-TTFT 764.42 754.62 758.67
P90-TPOT 56.32 56.18 56.06
QPS 3.37 3.37 3.37
1536 P50-TTFT 1464.77 1463.27 1485.56
P90-TPOT 85.12 84.31 81.38
P95-ITL 102.60 100.71 97.22
QPS 4.48 4.48 4.49
2048 P50-TTFT 1470.45 1463.95 1480.91
P90-TPOT 85.15 84.60 81.39
P95-ITL 102.87 100.21 97.08
QPS 4.48 4.48 4.49

Checklist

Signed-off-by: shangyuan-ant <cx483263@antgroup.com>
@yuan-luo
Copy link
Collaborator

Could you paste the performance gain result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants