[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

manver-iitk · 2024-06-08T16:26:49Z

Problem Description

I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??

Operating System

SLES (Frontier)

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

edgargabriel · 2024-07-03T16:23:25Z

@manver-iitk a couple of questions:

how many nodes did you use ( 1 node vs. more than 1 node) ?
if the answer is more than 1 node, how did you configure RCCL on Frontier? Did you use the RCCL libfabric plugin for the inter-node communication? If not, RCCL will end up using tcp sockets as far as I know (since Frontier does not support verbs API), which might explain why RCCL is so much slower than MPI Alltoall.

corey-derochie-amd · 2024-09-19T17:28:01Z

Hello, @manver-iitk .
Has this issue been resolved for you?

manver-iitk · 2024-10-07T13:39:39Z

Hello , @corey-derochie-amd my issue still persists.

@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes

thananon · 2024-10-29T15:29:00Z

Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.

haripriya-amd assigned edgargabriel Jul 17, 2024

ppanchad-amd added the Under Investigation label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

manver-iitk commented Jun 8, 2024

edgargabriel commented Jul 3, 2024

corey-derochie-amd commented Sep 19, 2024

manver-iitk commented Oct 7, 2024

thananon commented Oct 29, 2024

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

Comments

manver-iitk commented Jun 8, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

edgargabriel commented Jul 3, 2024

corey-derochie-amd commented Sep 19, 2024

manver-iitk commented Oct 7, 2024

thananon commented Oct 29, 2024