Fix deadlock in PipeEngine._exec_recv_grads #5518

i4never · 2024-05-10T02:45:35Z

I'm using Megtron-DeepSpeed with TP/PP/DP. In my case there are three tensors need to communicate between pipelines:

hidden_state (floating, need grad)
attention_mask (int32, no grad)
cached_rotray_embedding (floating, no grad)

Only first tensor has grad which meets the restriction of PipelineEngine here:

DeepSpeed/deepspeed/runtime/pipe/engine.py

Lines 734 to 736 in 3dd7ccf

    
           # TODO: Improve pipe partitioning to pass multiple tensors that require grads 
        
           assert all([torch.is_tensor(elt) and elt.requires_grad is False for elt in outputs[1:]]) 
        
           outputs_tail = outputs[1:]

Only grads of first tensor sended in first stage:

DeepSpeed/deepspeed/runtime/pipe/engine.py

Lines 1106 to 1109 in 3dd7ccf

    
           if self.is_grad_partitioned: 
        
               # First two sends are partitioned gradient 
        
               p2p.send(inputs[0], self.prev_stage) 
        
               p2p.send(inputs[1], self.prev_stage)

But the next stage try to recv more than one grad because tensor.is_floating_point() is used to filter outputs. In my case cached_rotray_embedding is floating tensor with no grad which caught by filter. Next stage expecting more data than sended makes training hangs.

DeepSpeed/deepspeed/runtime/pipe/engine.py

Lines 1206 to 1209 in 3dd7ccf

    
           if self.is_grad_partitioned: 
        
               sizes_and_dtypes = [(list(t.size()), t.dtype) 
        
                                   for t in outputs[:2]] + [(list(t.size()), t.dtype) 
        
                                                            for t in outputs[2:] if t.is_floating_point()]

Since only one grad is send anyway, we don't need is_floating_point filter here.

i4never · 2024-05-14T01:14:29Z

rebase master

i4never · 2024-05-15T03:36:07Z

Hi @tjruwase, Could you review this pls?

tjruwase · 2024-05-15T15:17:00Z

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

loadams · 2024-07-16T17:26:16Z

Hi @tjruwase, Could you review this pls?

@i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer available. Although, I think your changes is correct, I think it is best to be more cautious. So, can you please add some unit tests?

@i4never - would you be able to add any unit tests?

i4never · 2024-07-18T03:00:44Z

sure, I'm working on this.

loadams · 2024-07-24T16:21:33Z

sure, I'm working on this.

Thanks @i4never - ping me whenever this needs review/tests/etc.

i4never requested a review from duli2012 as a code owner May 10, 2024 02:45

i4never force-pushed the fix/pipeengine_communication branch from 3ccae4b to 890cccc Compare May 14, 2024 01:13

i4never force-pushed the fix/pipeengine_communication branch 2 times, most recently from 0d5f2f5 to 5d10817 Compare May 15, 2024 01:48

i4never force-pushed the fix/pipeengine_communication branch 2 times, most recently from 156a2d7 to ad4fe8a Compare May 21, 2024 05:59

i4never force-pushed the fix/pipeengine_communication branch from ad4fe8a to 46eb620 Compare May 30, 2024 05:44

i4never force-pushed the fix/pipeengine_communication branch from 46eb620 to 50ec241 Compare June 14, 2024 01:31

i4never force-pushed the fix/pipeengine_communication branch from 34b1fd1 to 3a58893 Compare July 16, 2024 01:17

[FIX] fix deadlock in PipeEngine._exec_recv_grads

29419dc

i4never force-pushed the fix/pipeengine_communication branch from 9ae7461 to 29419dc Compare August 28, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in PipeEngine._exec_recv_grads #5518

Fix deadlock in PipeEngine._exec_recv_grads #5518

i4never commented May 10, 2024 •

edited

Loading

i4never commented May 14, 2024

i4never commented May 15, 2024

tjruwase commented May 15, 2024

loadams commented Jul 16, 2024

i4never commented Jul 18, 2024

loadams commented Jul 24, 2024

	# TODO: Improve pipe partitioning to pass multiple tensors that require grads
	assert all([torch.is_tensor(elt) and elt.requires_grad is False for elt in outputs[1:]])
	outputs_tail = outputs[1:]

	if self.is_grad_partitioned:
	# First two sends are partitioned gradient
	p2p.send(inputs[0], self.prev_stage)
	p2p.send(inputs[1], self.prev_stage)

	if self.is_grad_partitioned:
	sizes_and_dtypes = [(list(t.size()), t.dtype)
	for t in outputs[:2]] + [(list(t.size()), t.dtype)
	for t in outputs[2:] if t.is_floating_point()]

Fix deadlock in PipeEngine._exec_recv_grads #5518

Are you sure you want to change the base?

Fix deadlock in PipeEngine._exec_recv_grads #5518

Conversation

i4never commented May 10, 2024 • edited Loading

i4never commented May 14, 2024

i4never commented May 15, 2024

tjruwase commented May 15, 2024

loadams commented Jul 16, 2024

i4never commented Jul 18, 2024

loadams commented Jul 24, 2024

i4never commented May 10, 2024 •

edited

Loading