-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694
base: master
Are you sure you want to change the base?
Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694
Conversation
…pSpeed into z3_coalesced_fetch
d9c3687
to
663e637
Compare
Hi , @tjruwase. Some models use a large number of MoE experts, but the total number of parameters for MoE remains unchanged. This PR handles such cases as a whole to optimize memory management and reduce bubbles caused by long loops. when Do you have any suggestions? |
for part_to_copy in partitions: | ||
if not get_accelerator().is_synchronized_device(): | ||
part_to_copy.record_stream(get_accelerator().current_stream()) | ||
if handle_dependency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the following not work?
if handle_dependency: | |
if handle_dependency and not get_accelerator().is_synchronized_device() : |
depend on #6649
When performing fetch/release operations on Z3 leaf modules, the loop time is excessively long in fine-grained module. Compared to non-leaf modules, Z3 leaf modules may include a larger number of parameters. Although each loop unit does not consume much time, the overall loop length can be significant.
The fetch time is impacted by:
Post-allgather operations (narrow, slice ,cat, difficult to avoid)
Memory pressure(record_stream/fetch event create&sync)
The release time is impacted by:
slice
Free parameter record_stream
Considering the fine-grained leaf modules, where each parameter is relatively small, we can treat the parameters within each leaf module as a unified entity to handle memory pressure. This approach can approximately halve the CPU time required for fetch/release operations.