-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does this support torch.nn.parallel.DistributedDataParallel? #1
Comments
Currently not. The implementation is designed for multi-GPU BatchNorm, which is commonly used for Computer Vision tasks. Thus, it uses NCCL for multi-GPU broadcasting and reduction. Distributed version needs other types of synchronization primitive operations (e.g., via shared memory or cross-machine synchronization). Contributions will be highly appreciated! |
Just let you know that PyTorch Compatible Synchronized Batch Norm is provided here http://hangzh.com/PyTorch-Encoding/index.html |
@zhanghang1989 Does this support torch.nn.parallel.DistributedDataParallel? |
@zhanghang1989 Excuse me, I only see that you use the |
BTW, the python notebook is 404. |
@zhanghang1989 Hi Hang. Thanks for the introduction. This repo aims at providing a standalone and easy-to-use version of sync_bn such that it can be easily integrated into any existing frameworks. Its implementation also differs from your previous implementation. For an example use of the sync_bn, please check: As for the DistributedDataParallel, currently, I have no plan for supporting it. @zhanghang1989 Have you tested your implementation in the distributed setting? |
Thanks @vacancy for the introduction! Nice work. I have plan of supporting distributed training, due to a recent paper use it in object detection. |
@zhanghang1989 I am not an expert at this. But it seems to me that the DistributedDataParallel uses different implementation as DataParallel for broadcasting and reduction. While in a single process, multi-thread setting (DataParallel), one can use simple NCCL's |
@vacancy Thanks for the information. |
I read the source code of @zhanghang1989 I think your syncbn use only |
Hi @acgtyrant What do you exactly mean by "SynchronizedBatchNorm2d is not numerical stable only"? |
My tests show that my implemented SynchronizedBatchNorm2d is not numerical stable as your bn, there is not any other error. So now that your bn can work, my bn should work too. But it does not, I do not know why... So I need some help now. |
@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request? Thank you all for this beautiful work! |
My distributed synced bn works in DistributedDataParallel, but because of the secrecy laws from my boss, I delete the post which contains the source of the distributed synced bn, sorry. However, it is easy to be implemented, the torch.distributed.all_reduce is synced automatically, so just warp it as an autograd function and use it to all reduce sum and sum_of_square, then use this function in your implemented synced bn module. |
FYI, there is an open source repo created by NVIDIA, https://github.com/NVIDIA/apex, which supports SyncBN with In fact, till now, it only supports SyncBN with |
How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work? |
For |
No description provided.
The text was updated successfully, but these errors were encountered: