Does this support torch.nn.parallel.DistributedDataParallel? #1

acgtyrant · 2018-04-10T04:26:37Z

No description provided.

vacancy · 2018-04-10T06:10:02Z

Currently not.

The implementation is designed for multi-GPU BatchNorm, which is commonly used for Computer Vision tasks. Thus, it uses NCCL for multi-GPU broadcasting and reduction.

Distributed version needs other types of synchronization primitive operations (e.g., via shared memory or cross-machine synchronization).

Contributions will be highly appreciated!

zhanghang1989 · 2018-04-13T06:12:39Z

Just let you know that PyTorch Compatible Synchronized Batch Norm is provided here http://hangzh.com/PyTorch-Encoding/index.html
See the example here.

acgtyrant · 2018-04-13T06:28:39Z

@zhanghang1989 Does this support torch.nn.parallel.DistributedDataParallel?

acgtyrant · 2018-04-13T06:48:01Z

@zhanghang1989 Excuse me, I only see that you use the DataParallel, not the DistributedDataParallel. If you are sure, I use the DistributedDataParallel by myself later.

acgtyrant · 2018-04-13T06:49:39Z

BTW, the python notebook is 404.

vacancy · 2018-04-13T08:21:32Z

@zhanghang1989 Hi Hang. Thanks for the introduction. This repo aims at providing a standalone and easy-to-use version of sync_bn such that it can be easily integrated into any existing frameworks. Its implementation also differs from your previous implementation.

For an example use of the sync_bn, please check:
https://github.com/CSAILVision/semantic-segmentation-pytorch

As for the DistributedDataParallel, currently, I have no plan for supporting it. @zhanghang1989 Have you tested your implementation in the distributed setting?

zhanghang1989 · 2018-04-13T14:41:55Z

Thanks @vacancy for the introduction! Nice work. I have plan of supporting distributed training, due to a recent paper use it in object detection.
@acgtyrant The only thing need to be considered for distribute is making sure the number of gpus is set correctly https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/syncbn.py#L196

vacancy · 2018-04-13T15:51:28Z

@zhanghang1989 I am not an expert at this. But it seems to me that the DistributedDataParallel uses different implementation as DataParallel for broadcasting and reduction. While in a single process, multi-thread setting (DataParallel), one can use simple NCCL's broadcast and reduce, for tensors shared across multiple processes or even multiple machines, we need special implementation, which is defined in the package torch.distributed.

zhanghang1989 · 2018-04-13T16:36:34Z

@vacancy Thanks for the information.

acgtyrant · 2018-04-23T08:44:55Z

I read the source code of DistributedDataParallel, I find that it does not broadcast the parameters like DataParallel, it only all_reduce the gradient so that all model_replicates use the same gradient to optimize the same model at all, and all model_replicates use the same buffer which is broadcasted from the model in the device 0 of the rank 0 node.

@zhanghang1989 I think your syncbn use only comm.broadcast_coalesced and comm.reduce_add_coalesced which only support cross-gpu, it does not support cross-node which is distributed.

vacancy · 2018-10-30T14:46:59Z

Hi @acgtyrant

What do you exactly mean by "SynchronizedBatchNorm2d is not numerical stable only"?

acgtyrant · 2018-10-30T14:50:43Z

My tests show that my implemented SynchronizedBatchNorm2d is not numerical stable as your bn, there is not any other error. So now that your bn can work, my bn should work too. But it does not, I do not know why... So I need some help now.

acgtyrant · 2018-11-02T07:24:46Z

After fix the wrong view of the output, I retrain drn now, and it seems as expected.

Cadene · 2018-11-04T13:08:03Z

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

acgtyrant · 2018-11-07T05:36:49Z

My distributed synced bn works in DistributedDataParallel, but because of the secrecy laws from my boss, I delete the post which contains the source of the distributed synced bn, sorry.

However, it is easy to be implemented, the torch.distributed.all_reduce is synced automatically, so just warp it as an autograd function and use it to all reduce sum and sum_of_square, then use this function in your implemented synced bn module.

Spritea · 2019-01-09T08:40:38Z

@acgtyrant Is DistributedDataParallel working for you? Are you planning to send a Pull Request?

Thank you all for this beautiful work!

FYI, there is an open source repo created by NVIDIA, https://github.com/NVIDIA/apex, which supports SyncBN with DistributedDataParallel .

In fact, till now, it only supports SyncBN with DistributedDataParallel, and doesn't support SyncBN with DataParallel, see this issue. NVIDIA/apex#115

qchenclaire · 2019-07-16T02:12:12Z

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

zhanghang1989 · 2019-07-16T17:34:48Z

How about using torch.nn.parallel.DistributedDataParallel but only running on one node with 8 gpus? Does this repo work?

For torch.nn.parallel.DistributedDataParallel please use torch.nn.SyncBatchNorm

vacancy added enhancement New feature or request help wanted Extra attention is needed labels Apr 10, 2018

vacancy removed the help wanted Extra attention is needed label Aug 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this support torch.nn.parallel.DistributedDataParallel? #1

Does this support torch.nn.parallel.DistributedDataParallel? #1

acgtyrant commented Apr 10, 2018

vacancy commented Apr 10, 2018

zhanghang1989 commented Apr 13, 2018 •

edited

Loading

acgtyrant commented Apr 13, 2018

acgtyrant commented Apr 13, 2018

acgtyrant commented Apr 13, 2018

vacancy commented Apr 13, 2018 •

edited

Loading

zhanghang1989 commented Apr 13, 2018

vacancy commented Apr 13, 2018

zhanghang1989 commented Apr 13, 2018

acgtyrant commented Apr 23, 2018

vacancy commented Oct 30, 2018

acgtyrant commented Oct 30, 2018 •

edited

Loading

acgtyrant commented Nov 2, 2018

Cadene commented Nov 4, 2018

acgtyrant commented Nov 7, 2018 •

edited

Loading

Spritea commented Jan 9, 2019 •

edited

Loading

qchenclaire commented Jul 16, 2019

zhanghang1989 commented Jul 16, 2019

Does this support torch.nn.parallel.DistributedDataParallel? #1

Does this support torch.nn.parallel.DistributedDataParallel? #1

Comments

acgtyrant commented Apr 10, 2018

vacancy commented Apr 10, 2018

zhanghang1989 commented Apr 13, 2018 • edited Loading

acgtyrant commented Apr 13, 2018

acgtyrant commented Apr 13, 2018

acgtyrant commented Apr 13, 2018

vacancy commented Apr 13, 2018 • edited Loading

zhanghang1989 commented Apr 13, 2018

vacancy commented Apr 13, 2018

zhanghang1989 commented Apr 13, 2018

acgtyrant commented Apr 23, 2018

vacancy commented Oct 30, 2018

acgtyrant commented Oct 30, 2018 • edited Loading

acgtyrant commented Nov 2, 2018

Cadene commented Nov 4, 2018

acgtyrant commented Nov 7, 2018 • edited Loading

Spritea commented Jan 9, 2019 • edited Loading

qchenclaire commented Jul 16, 2019

zhanghang1989 commented Jul 16, 2019

zhanghang1989 commented Apr 13, 2018 •

edited

Loading

vacancy commented Apr 13, 2018 •

edited

Loading

acgtyrant commented Oct 30, 2018 •

edited

Loading

acgtyrant commented Nov 7, 2018 •

edited

Loading

Spritea commented Jan 9, 2019 •

edited

Loading