Behavior of on_epoch and sync_dist in self.log when logging raw tensors under DDP #21608
-
|
When using DDP/DDP Spawn strategies, I’d like to understand how sync_dist works for the following logging call:
Is the synchronization across GPUs performed at every step or does synchronization happen only once at the end of the epoch assuming on_step is False? Understanding this behavior would help me decide whether to use this, since per step synchronization could have significant overhead. I read the relevant code and it seems to me like the synchronization happens only at end of epoch, but I would appreciate confirmation from experts Thank You! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
sync_dist happens at the end of the epoch when you set |
Beta Was this translation helpful? Give feedback.
sync_dist happens at the end of the epoch when you set
on_epoch=Trueandon_step=False. So you're right, it avoids the per-step overhead.