Releases: ROCm/rccl
Releases · ROCm/rccl
RCCL 2.22.3 for ROCm 6.4.0
Added
RCCL_SOCKET_REUSEADDR
andRCCL_SOCKET_LINGER
environment parameters- Setting
NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=VERBS
will generate traces for fifo and data ibv_post_sends - Added
--log-trace
flag to enable traces through the install.sh script (e.g../install.sh --log-trace
)
Changed
- Compatibility with NCCL 2.22.3
- Added support for the rail-optimized tree algorithm for the MI300 series. This feature requires the use of all eight GPUs within
each node. It limits NIC traffic to use only GPUs of the same index across nodes and should not impact performance
on non-rail-optimized network topologies. The original method of building trees can be enabled by setting the
environment variableRCCL_DISABLE_RAIL_TREES=1
. - Additional debug information about how the trees are built can be logged to the GRAPH logging subsys by setting
RCCL_OUTPUT_TREES=1
.
rccl 2.21.5 for ROCm 6.3.3
RCCL code for ROCm 6.3.3 did not change. The library was rebuilt for the updated ROCm 6.3.3 stack.
rccl 2.21.5 for ROCm 6.3.2
RCCL code for ROCm 6.3.2 did not change. The library was rebuilt for the updated ROCm 6.3.2 stack.
RCCL 2.21.5 for ROCm 6.3.1
Added
Changed
- Enhanced user documentation
Resolved issues
- Corrected user help strings in
install.sh
RCCL 2.21.5 for ROCm 6.3.0
Added
- MSCCL++ integration for specific contexts
- Performance collection to rccl_replayer
- Tuner Plugin example for MI300
- Tuning table for large number of nodes
- Support for amdclang++
- New Rome model
Changed
- Compatibility with NCCL 2.21.5
- Increased channel count for MI300X multi-node
- Enabled MSCCL for single-process multi-threaded contexts
- Enabled gfx12
- Enabled CPX mode for MI300X
- Enabled tracing with rocprof
- Improved version reporting
- Enabled GDRDMA for Linux kernel 6.4.0+
Resolved issues
- Fixed model matching with PXN enable
Known issues
- MSCCL is temporarily disabled for AllGather collectives.
- This can impact in-place messages (< 2 MB) with ~2x latency.
- Older RCCL versions are not impacted.
- This issue will be addressed in a future ROCm release.
- Unit tests do not exit gracefully when running on a single GPU.
- This issue will be addressed in a future ROCm release.
rccl 2.20.5 for ROCm 6.2.4
RCCL code for ROCm 6.2.4 did not change. The library was rebuilt for the updated ROCm 6.2.4 stack.
rccl 2.20.5 for ROCm 6.2.2
RCCL code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
rccl 2.20.5 for ROCm 6.2.1
RCCL code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.
RCCL 2.20.5 for ROCm 6.2.0
Changed
- Compatibility with NCCL 2.20.5
- Compatibility with NCCL 2.19.4
- Performance tuning for some collective operations on MI300
- Enabled NVTX code in RCCL
- Replaced rccl_bfloat16 with hip_bfloat16
- NPKit updates:
- Removed warm-up iteration removal by default, need to opt in now
- Doubled the size of buffers to accommodate for more channels
- Modified rings to be rail-optimized topology friendly
- Replaced ROCmSoftwarePlatform links with ROCm links
Added
- Support for fp8 and rccl_bfloat8
- Support for using HIP contiguous memory
- Implemented ROC-TX for host-side profiling
- Enabled static build
- Added new rome model
- Added fp16 and fp8 cases to unit tests
- New unit test for main kernel stack size
- New -n option for topo_expl to override # of nodes
- Improved debug messages of memory allocations
- Channel shuffling for IB systems
Fixed
- Bug when configuring RCCL for only LL128 protocol
- Scratch memory allocation after API change for MSCCL
- Incorrect minNchannels in multi-node
rccl 2.18.6 for ROCm 6.1.5
RCCL code for ROCm 6.1.5 did not change. The library was rebuilt for the updated ROCm 6.1.5 stack.