Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to properly sort the communication operators for zero2/zero3 #574

Closed
kiya00 opened this issue Jun 11, 2024 · 1 comment
Closed
Assignees

Comments

@kiya00
Copy link
Collaborator

kiya00 commented Jun 11, 2024

🐛 Bug

Background

There are problems with the positioning of the communication and computation operators, causing them not to overlap well.

Currently we have 2 functions that rely on topological sorting and the weights we specify to sort the communication and computation operators

def sort_communication_ops(execution_trace):

bottom-up sorting, relying on the order of the outputs, can have problems with the position of reduce_scatter, e.g. in issue #557

def sort_waits(execution_trace):

top-down sorting, relying on the order of the inputs, can have problems with the position of all_gathers, e.g. in issue #277

Expectation:

For zero2:
sort the allgathers to their consumer order and list them at the beginning of the trace, the corresponding waits are right before the consumers (maximum distance between allgathers and waits).
put the reduce/reduce_scatter just after its producer and corresponding wait just before its consumer (maximum distance between reduce and wait)

For zero3:
same as zero2, then apply the appropriate limit on in-flight allgathers to balance memory usage.

cc: @IvanYashchuk @crcrpar @kshitij12345

cc @carmocca @crcrpar

@kiya00
Copy link
Collaborator Author

kiya00 commented Jun 27, 2024

As explained here #592 (comment), I'll close this issue

@kiya00 kiya00 closed this as completed Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants