The role played by the torch RPC module #57

fakeYan · 2023-07-13T01:32:20Z

fakeYan
Jul 13, 2023

Hello，
I have a simple question but very important to me. I am studying and researching the torch RPC module recently, and trying to contribute code to the torch community.

What role does the RPC module play in the optimization of this project, and why do we need to use RPC of torch? In this project, can the torch RPC module be further optimized only through encapsulation at the python level?

In essence, is it dependent on the ability of torch RPC or cuda's IPC and NVLink capabilities?

Looking forward to your answer.

baoleai · 2023-07-13T02:17:53Z

baoleai
Jul 13, 2023
Maintainer

To clarify, we only use Torch RPC in distributed part. The reason is primarily due to its compatibility with the PyTorch ecosystem and its ability to facilitate communication through protocols like TCP/RDMA. As for IPC and NVLink(not required distributed settings), they are implemented directly by CUDA at lower level in C++, and are not integrated through the Torch RPC framework.

4 replies

fakeYan Jul 14, 2023
Author

Thank you very much for your timely reply, usually I have to wait a long time for a reply or review in the torch

From my current understanding of the implementation of the torch RPC module, I found that the corresponding CUDA protocols or capabilities such as IPC are registered in the RPC module. I understand that without such capabilities, GPU-CPU-send-CPU-GPU will appear in distributed data transmission. Data copying and transfer can become a significant bottleneck.

baoleai Jul 14, 2023
Maintainer

Torch RPC supports direct communication between GPUs across machines if there is GPU Direct RDMA available. Various communication methods can be specified here https://github.com/alibaba/graphlearn-for-pytorch/blob/main/graphlearn_torch/python/distributed/rpc.py#L253 such as shm, ibv.

In distributed training scenarios across machines, communication can often become a bottleneck. In addition to utilizing high-bandwidth networks, we can also mitigate this by implementing techniques such as caching to reduce the amount of cross-machine communication.

fakeYan Jul 14, 2023
Author

Thank you very much. I may understand a little bit. I currently see the rpc settings in your code, and it seems that data transmission is performed through the cpu. This part does not need to use the cuda channel, is it because the tensor itself is on the cpu?

baoleai Jul 14, 2023
Maintainer

For the current version of GLT, we don't require that there is GDR (GPU Direct RDMA) env, so we convert the cuda tensor to cpu tensor before calling rpc, which does result in additional data transfer overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The role played by the torch RPC module #57

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The role played by the torch RPC module #57

fakeYan Jul 13, 2023

Replies: 1 comment · 4 replies

baoleai Jul 13, 2023 Maintainer

fakeYan Jul 14, 2023 Author

baoleai Jul 14, 2023 Maintainer

fakeYan Jul 14, 2023 Author

baoleai Jul 14, 2023 Maintainer

fakeYan
Jul 13, 2023

Replies: 1 comment 4 replies

baoleai
Jul 13, 2023
Maintainer

fakeYan Jul 14, 2023
Author

baoleai Jul 14, 2023
Maintainer

fakeYan Jul 14, 2023
Author

baoleai Jul 14, 2023
Maintainer