Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: RCCL collective call Alltoall is performing way worse than normal MPI Alltoall on Frontier. #1206

Open
manver-iitk opened this issue Jun 8, 2024 · 4 comments
Assignees

Comments

@manver-iitk
Copy link

Problem Description

I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??

Operating System

SLES (Frontier)

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 5.7.1

ROCm Component

rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@edgargabriel
Copy link
Contributor

@manver-iitk a couple of questions:

  • how many nodes did you use ( 1 node vs. more than 1 node) ?
  • if the answer is more than 1 node, how did you configure RCCL on Frontier? Did you use the RCCL libfabric plugin for the inter-node communication? If not, RCCL will end up using tcp sockets as far as I know (since Frontier does not support verbs API), which might explain why RCCL is so much slower than MPI Alltoall.

@corey-derochie-amd
Copy link
Collaborator

Hello, @manver-iitk .
Has this issue been resolved for you?

@manver-iitk
Copy link
Author

Hello , @corey-derochie-amd my issue still persists.

@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes

@thananon
Copy link
Contributor

Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants