Training stuck with multiple call of forward function #46

ArmastusChen · 2021-05-08T22:43:41Z

Hi,

Thank you for the great code. I have looked at the related issues but it turns out that it doesnt help in my case. I have a network using your Sync BN. I try to call the forward pass of the model for 4 times and sum over all the 4 outputs, and it stuck in the last forward call. If I reduce the number of calling to 3, everything works fine. I am sure that I do the same thing on different GPUs.

Besides, if I dont do the sum, then my code also works well. It is really wired so I would like to ask if you have any suggestions? Thanks!

vacancy · 2021-05-08T23:38:47Z

You can either add checkpoints/do debug printing in the main forward function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L78 or provide a minimal script that can reproduce the issue, so that I can take a look.

ArmastusChen · 2021-05-09T03:48:11Z

Hi,

I am able to run the code after reducing the network size. I tested on two 12GB Titan X: If my network is large, lets say 20GB, then the sync BN will stuck at some points without reporting any error. Then I sightly shrink the model to 16GB, then I can get the OOM error. Further reducing the model size works well for me. I quickly looked at your source code, but didnt figure out why this happened. Do you have any ideas?

vacancy · 2021-05-10T04:46:35Z

Interesting findings!

Can you add additional prints before and after https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L133 and L136 to check if these are the lines that the code gets stuck?

A little background: to make the same module run on multiple GPUs, PyTorch actually duplicates the same module for N times, each one running on a separate thread. To make a synchronized batch normalization, we need to add barriers. These barriers require transmitting data among GPUs and thus need additional malloc operations on GPUs. It is possible that a deadlock is happening:

Child process A is waiting for the data (e.g., the batch statistics) from the main process.
The main process is trying to send the data to child A, but is waiting for the torch process running on A to release some memory so that it can allocate memory on GPU A.

Unfortunately, there are currently no simple ways to check if such a deadlock is happening as we can directly check PyTorch memory management in Python.

ArmastusChen · 2021-05-11T04:13:53Z

Hi,

Thanks for your reply. It turns out that it was not stuck in these two lines. My finding is that:

For this function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/7553990fb9a917cddd9342e89b6dc12a70573f5b/sync_batchnorm/batchnorm.py#L78

The master process will execute this function one time less than the slave process, so slave process just keep waiting the message from master.

vacancy · 2021-05-11T04:44:01Z

This is weird. Why is that? At least, the forward function should be called on each process (master/slave) for the same number of times.

ArmastusChen · 2021-05-11T04:52:19Z

Not sure about it... But it works fine when there is no OOM issue, so I guess the training code is correct. My code is modified based on this repo https://github.com/swabhs/open-sesame

vacancy · 2021-05-11T16:53:40Z

Can you be more specific about how did you come to the conclusion that the forward function on the master process gets called one time less than the on the slave? Any code snippets on how you modify this repo will be greatly helpful.

leozjr mentioned this issue Mar 29, 2024

Could you kindly provide a list of the environment configurations? bit-isp/HSIR#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stuck with multiple call of forward function #46

Training stuck with multiple call of forward function #46

ArmastusChen commented May 8, 2021

vacancy commented May 8, 2021

ArmastusChen commented May 9, 2021

vacancy commented May 10, 2021

ArmastusChen commented May 11, 2021 •

edited

Loading

vacancy commented May 11, 2021

ArmastusChen commented May 11, 2021

vacancy commented May 11, 2021

Training stuck with multiple call of forward function #46

Training stuck with multiple call of forward function #46

Comments

ArmastusChen commented May 8, 2021

vacancy commented May 8, 2021

ArmastusChen commented May 9, 2021

vacancy commented May 10, 2021

ArmastusChen commented May 11, 2021 • edited Loading

vacancy commented May 11, 2021

ArmastusChen commented May 11, 2021

vacancy commented May 11, 2021

ArmastusChen commented May 11, 2021 •

edited

Loading