Avoid unnecessary NCCL collective coalescing in distributed optimizer #1847

timmoon10 · 2024-09-28T22:48:06Z

I've been experiencing some data corruption in distributed optimizer checkpoints because PyTorch is not properly synchronizing the NCCL stream with the main CUDA stream. All indications point to a bug in PyTorch's infrastructure for coalesced NCCL calls and I've isolated it down to cases where we enter PyTorch's _coalescing_manager but do not perform any NCCL collectives. The debugger suggests that _coalescing_manager sets this flag when it enters the context and fails to unset it, resulting in weird behavior in later NCCL calls. I haven't fully bottomed out this bug, but this PR fixes the issue for me.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

crcrpar

Looks reasonable to me, thank you

Avoid NCCL collective coalescing in distopt when not needed

5159545

Signed-off-by: Tim Moon <tmoon@nvidia.com>

crcrpar approved these changes Sep 29, 2024

View reviewed changes

crcrpar merged commit 6102d2c into NVIDIA:master Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary NCCL collective coalescing in distributed optimizer #1847

Avoid unnecessary NCCL collective coalescing in distributed optimizer #1847

timmoon10 commented Sep 28, 2024 •

edited

Loading

crcrpar left a comment

Avoid unnecessary NCCL collective coalescing in distributed optimizer #1847

Avoid unnecessary NCCL collective coalescing in distributed optimizer #1847

Conversation

timmoon10 commented Sep 28, 2024 • edited Loading

crcrpar left a comment

Choose a reason for hiding this comment

timmoon10 commented Sep 28, 2024 •

edited

Loading