Fix deadlock in PipeEngine._exec_recv_grads #5518
Open
+1
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:
hidden_state
(floating, need grad)attention_mask
(int32, no grad)cached_rotray_embedding
(floating, no grad)Only first tensor has grad which meets the restriction of PipelineEngine here:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 734 to 736 in 3dd7ccf
Only grads of first tensor sended in first stage:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 1106 to 1109 in 3dd7ccf
But the next stage try to recv more than one grad because
tensor.is_floating_point()
is used to filter outputs. In my casecached_rotray_embedding
is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs.DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 1206 to 1209 in 3dd7ccf
Since only one grad is send anyway, we don't need is_floating_point filter here.