You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
top-down sorting, relying on the order of the inputs, can have problems with the position of all_gathers, e.g. in issue #277
Expectation:
For zero2:
sort the allgathers to their consumer order and list them at the beginning of the trace, the corresponding waits are right before the consumers (maximum distance between allgathers and waits).
put the reduce/reduce_scatter just after its producer and corresponding wait just before its consumer (maximum distance between reduce and wait)
For zero3:
same as zero2, then apply the appropriate limit on in-flight allgathers to balance memory usage.
🐛 Bug
Background
There are problems with the positioning of the communication and computation operators, causing them not to overlap well.
Currently we have 2 functions that rely on topological sorting and the weights we specify to sort the communication and computation operators
lightning-thunder/thunder/distributed/utils.py
Line 54 in 067f15a
bottom-up sorting, relying on the order of the outputs, can have problems with the position of reduce_scatter, e.g. in issue #557
lightning-thunder/thunder/distributed/utils.py
Line 112 in 067f15a
top-down sorting, relying on the order of the inputs, can have problems with the position of all_gathers, e.g. in issue #277
Expectation:
For zero2:
sort the allgathers to their consumer order and list them at the beginning of the trace, the corresponding waits are right before the consumers (maximum distance between allgathers and waits).
put the reduce/reduce_scatter just after its producer and corresponding wait just before its consumer (maximum distance between reduce and wait)
For zero3:
same as zero2, then apply the appropriate limit on in-flight allgathers to balance memory usage.
cc: @IvanYashchuk @crcrpar @kshitij12345
cc @carmocca @crcrpar
The text was updated successfully, but these errors were encountered: