You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??
Operating System
SLES (Frontier)
CPU
AMD EPYC 7763 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
how many nodes did you use ( 1 node vs. more than 1 node) ?
if the answer is more than 1 node, how did you configure RCCL on Frontier? Did you use the RCCL libfabric plugin for the inter-node communication? If not, RCCL will end up using tcp sockets as far as I know (since Frontier does not support verbs API), which might explain why RCCL is so much slower than MPI Alltoall.
@edgargabriel i have installed the aws_ofi_rccl driver also for inter node communication. But still timmings is almost 2x to 3x of normal MPI. I'm using 4 to 8 nodes
Hi, for alltoall, RCCL uses fan-out algorithm which is very crude (everyone send and recv from everyone). Whereas MPI is doing this in a more algorithmic way. This is the area where we acknowledge NCCL/RCCL lacks. Unfortunately optimizing alltoall for multi-node is not high on our priority list.
Problem Description
I ran my code on Frontier for scaling on AMD GPUS. It scaled fine with MPI . But as soon as i replace the MPI_Alltoall call with nccl_Alltoall, it is behaving way worse than MPI. why??
Operating System
SLES (Frontier)
CPU
AMD EPYC 7763 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: