-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort allgathers according to consumer order, reduce scatter according to producer order #592
Conversation
… to producer order (#574)
|
The issue description has the following expectation:
but the description of this pull request says that allgather sorting puts the allgather op right before the wait op so the communication wouldn't be overlapping with computation. Is the pull request description accurate? |
currently the single sort_allgather puts the allgather op right before the wait op, and the number in limit_in_flight_allgathers (INT_MAX, 3) will get the allgather to proper position for zero2/3. |
After some discussion with @IvanYashchuk , I think we can use the fix suggested by @kshitij12345 , the reduce_scatters before sorting are already in the right position, it's enough to solve the #557 . I think we could consider using this PR if we can not rely on the original order one day. But for now I'll close it.
|
Before submitting
What does this PR do?
Fixes #574.
This PR suggests a way to sort the communication ops. Previously we used one function to sort the allgather and reduce_scatter, but ran into some problems when dealing with topological equal nodes (when we sort allgather and reduce_scatter in one pass, if top-down, it results in allgather relying on the order of input params, if bottom-up, it results in reduce_scatter might accumulate before wait)
In this PR we have 2 sorting function:
sort allgather: bottom-up topological sorting, sort the all_gather_prim_impl and its wait nodes according to the consumer order, and put all_gather_prim_impl just before wait.
sort reduce_scatter: top-down topological sorting, sort the reduce/reduce_scatter and its wait node according to the producer order, and maximum the distance between reduce/reduce_scatter and wait.
But since our reduce_scatter is already in the right place without the sorting, we need to think about whether this PR is really necessary