Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck with multiple call of forward function #46

Open
ArmastusChen opened this issue May 8, 2021 · 7 comments
Open

Training stuck with multiple call of forward function #46

ArmastusChen opened this issue May 8, 2021 · 7 comments

Comments

@ArmastusChen
Copy link

Hi,

Thank you for the great code. I have looked at the related issues but it turns out that it doesnt help in my case. I have a network using your Sync BN. I try to call the forward pass of the model for 4 times and sum over all the 4 outputs, and it stuck in the last forward call. If I reduce the number of calling to 3, everything works fine. I am sure that I do the same thing on different GPUs.

Besides, if I dont do the sum, then my code also works well. It is really wired so I would like to ask if you have any suggestions? Thanks!

@vacancy
Copy link
Owner

vacancy commented May 8, 2021

You can either add checkpoints/do debug printing in the main forward function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L78 or provide a minimal script that can reproduce the issue, so that I can take a look.

@ArmastusChen
Copy link
Author

Hi,

I am able to run the code after reducing the network size. I tested on two 12GB Titan X: If my network is large, lets say 20GB, then the sync BN will stuck at some points without reporting any error. Then I sightly shrink the model to 16GB, then I can get the OOM error. Further reducing the model size works well for me. I quickly looked at your source code, but didnt figure out why this happened. Do you have any ideas?

@vacancy
Copy link
Owner

vacancy commented May 10, 2021

Interesting findings!

Can you add additional prints before and after https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/master/sync_batchnorm/batchnorm.py#L133 and L136 to check if these are the lines that the code gets stuck?

A little background: to make the same module run on multiple GPUs, PyTorch actually duplicates the same module for N times, each one running on a separate thread. To make a synchronized batch normalization, we need to add barriers. These barriers require transmitting data among GPUs and thus need additional malloc operations on GPUs. It is possible that a deadlock is happening:

  • Child process A is waiting for the data (e.g., the batch statistics) from the main process.
  • The main process is trying to send the data to child A, but is waiting for the torch process running on A to release some memory so that it can allocate memory on GPU A.

Unfortunately, there are currently no simple ways to check if such a deadlock is happening as we can directly check PyTorch memory management in Python.

@ArmastusChen
Copy link
Author

ArmastusChen commented May 11, 2021

Hi,

Thanks for your reply. It turns out that it was not stuck in these two lines. My finding is that:

For this function: https://github.com/vacancy/Synchronized-BatchNorm-PyTorch/blob/7553990fb9a917cddd9342e89b6dc12a70573f5b/sync_batchnorm/batchnorm.py#L78

The master process will execute this function one time less than the slave process, so slave process just keep waiting the message from master.

@vacancy
Copy link
Owner

vacancy commented May 11, 2021

This is weird. Why is that? At least, the forward function should be called on each process (master/slave) for the same number of times.

@ArmastusChen
Copy link
Author

Not sure about it... But it works fine when there is no OOM issue, so I guess the training code is correct. My code is modified based on this repo https://github.com/swabhs/open-sesame

@vacancy
Copy link
Owner

vacancy commented May 11, 2021

Can you be more specific about how did you come to the conclusion that the forward function on the master process gets called one time less than the on the slave? Any code snippets on how you modify this repo will be greatly helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants