Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Oct 31, 2024

depend on #6649

When performing fetch/release operations on Z3 leaf modules, the loop time is excessively long in fine-grained module. Compared to non-leaf modules, Z3 leaf modules may include a larger number of parameters. Although each loop unit does not consume much time, the overall loop length can be significant.
image
The fetch time is impacted by:

Post-allgather operations (narrow, slice ,cat, difficult to avoid)
Memory pressure(record_stream/fetch event create&sync)
The release time is impacted by:
slice
Free parameter record_stream

Considering the fine-grained leaf modules, where each parameter is relatively small, we can treat the parameters within each leaf module as a unified entity to handle memory pressure. This approach can approximately halve the CPU time required for fetch/release operations.

@inkcherry inkcherry force-pushed the reduce_coalesced_fetch_bubble branch 2 times, most recently from d9c3687 to 663e637 Compare November 12, 2024 08:16
@inkcherry inkcherry requested a review from loadams as a code owner November 12, 2024 08:16
@inkcherry
Copy link
Contributor Author

Hi , @tjruwase. Some models use a large number of MoE experts, but the total number of parameters for MoE remains unchanged. This PR handles such cases as a whole to optimize memory management and reduce bubbles caused by long loops. when zero_module_granularity_threshold is enabled.

Do you have any suggestions?

@tjruwase tjruwase removed the request for review from awan-10 December 23, 2024 18:39
for part_to_copy in partitions:
if not get_accelerator().is_synchronized_device():
part_to_copy.record_stream(get_accelerator().current_stream())
if handle_dependency:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following not work?

Suggested change
if handle_dependency:
if handle_dependency and not get_accelerator().is_synchronized_device() :

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants