feat: Add Expert Affinity Aware EPLB algorithm. #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
The natively implemented EPLB algorithm primarily focuses on balancing the computational load across each GPU and machine but does not adequately account for inter-expert communication (such as cross-node communication). In large-scale expert parallelism scenarios, excessive cross-node communication is more likely to compromise computational efficiency.
Modifications
Building upon expert load tracking, we further record the top-k expert groups activated in each iteration to compute an expert affinity matrix (i.e., the probability of co-activation). After intra-card load balancing via EPLB, we adjust card placement based on the affinity between the expert with the highest load in one gpu and other experts within other gpus, thereby reducing subsequent cross-node communication. This approach can achieve an additional ~5% performance improvement over standard EPLB.
Accuracy Tests
Benchmarking and Profiling
Checklist