v1.10.0-rc2
Pre-release
Pre-release
1.10.0-rc2 (February 2, 2021)
Features:
Core
- Added support for Nvidia HPC SDK
- Added support for latest PGI and Clang
- Added support for ROCM-3.7+ (warning generated if older version detected)
Architecture
- Added Arm SVE memcpy()
- Redesigned Arm WFE support
- Improved clear_cache performance for Arm
- Added architecture detection for Zhaoxin CPU
CI
- Added release builds on CUDA 11
- Enabled performance validation in gtest
UCP
- Added locality awareness to the transport selection logic for GPU devices
- Added put/offload/short and put/offload/zcopy protocols
- Added receive message nbx routine
- Reworked AM implementation and API, which adds support for RNDV semantics
- Added support for multi-lane connection manager over TCP
- Added support for printing AM tls with info log level
- Implement flush and destroy for UCT EPs on UCP worker
- Reduced UCP request size
- Added support for keepalive protocol
- Added support for multi-fragment protocol
- Added implementation for protocol progress for eager, bcopy, and multicopy
- Improved selection logic for protocol selection
- Added new protocols for UCP get operation
- Added bcopy protocols with support for GPU memory
- Added RNDV protocol implementation for GPU devices (CUDA, ROCm)
- Set SOCKADDR_CM_ENABLE=y by default
- Added support for fast-path short with new tag protocols
- Added a new parameter to control the CM listener's backlog
- Added support sending AM RTS over short message protocol
- Added support for shared memory multi-lane when CM is used
UCT
- Added API for keepalive_timeout value
- Added add uct_completion.status
- Allowed transports to access multiple mem_types
- Removed status arg from uct_completion_callback_t
- Restructured uct_mem_alloc/uct_md_mem_alloc to use mem_type
- Updated documentation for uct_listener_params
- Lowered the log level for certain network errors
- Added cuda_copy wakeup feature
- Added wakeup support for shared memory
UCS
- Added "inf" and "auto" values to time units
- Added on-stack constructors for array and string buffer
- Added ucs_ptr_map_t data structure
- Added bool CSWAP
- Improved logging
- Added optimization for namespace processing
- Fixes for connection matching functionality
RDMA CORE (IB, ROCE, etc.)
- Added support for auto detection of adapative routing settings
- Added an option to poll TX CQ every progress iteration
- Added local and remote addresses to the reject error message
- Added support for UAR allocation with non-cacheable memory type
- Added support for multiple flush cancel without completion
- Added async events callback support
- Added detection for ConnectX-6, ConnectX-7 and BlueField-1/2 devices
- Added support for connection matching for UD
- Added a check for AM ordering
Java (preview)
- Added support for a different javadoc executable path for different java versions
- Added UCS memory type constants
- Added support build on Java10+
- Added support for io-vector datatype.
Tests
- Added CI for CUDA 11
- Added test_ucp_sockaddr_protocols.stream_short
- Reimplemented tests using NBX API
- Added flush(cancel) test
- Added memory_wait mode to perftest
- Added support for clang 10
- Refactored RMA and atomic tests, add memtype support
- Added test for uct_md_mem_query()
- Added request interrupt support
- Added support for connection manager fallbacks
- Added new ucp request test checking for leaks from the ptr_map
Documentation
- Added glossaries
Bugfixes:
Portability
- Fixes in print functions to use format string like PRIx64, etc.
Continues Integration:
- Fixes in Github release flow
- Fixes in docker image
Packaging
- Removed deb package dependencies
- Fixes in SPEC to make the RPM relocatable
Documentation
- Fixes in documentation for ucp_am_recv_data_nbx
- Fixes in quick start example
- Fixes in installation instruction
Tests
- Fixes for failures under valgrind runtime
- Fixes in mmap tests for 0-length RMA
- Fixes in definition of LAST_WQE wait timeout
- Fixes in ROCm for mem_buffer test
- Fixes in test name printing format
- Fixes in tcp_sockcm test
UCP
- Fixes in worker cleanup flow
CUDA
- Fixes in managed memory support
RDMA CORE (IB, ROCE, etc.)
- Fixes in assert definitions
- Fixes in printing an error about invalid AM Bcopy length for UD
- Fixes for thread safety support
- Fixes to get ROCE device name according to GID
- Fixes for SL selection
- Fixes in create STRICT_ORDER key
- Fixes addressing performance degradation in UD transport due to excess async events
UGNI
- Fixing disable logic in config
- Fixing clang 11 warnings
Java
- Fixes in build dependencies
- Fixes in constructing UcpRequest object on error
- Fixes in exception handling on endpoint closure request
- Fixes for segfault in UcpErrorHandler
UCP
- Fixes in datatype support for get_zcopy RNDV
- Fixes in connection manager disconnect
- Fixes in assert definitions
- Fixes in completion flow for failed EP
- Fixes in flush error handling flow
- Fixes in latency calculations for wireup protocol
- Fixes in offload completion with inlined data
- Fixes in unpacking flow
- Fixes in error handling for various protocols
UCT
- Fixes in flush TX
- Fixes in checks for enabling GPU Direct RDMA
UCS
- Fixes for crashes on incorrect value set in config
- Fixes in ptr_array
- Fixes in maximal size for ucs_snprintf_safe()
- Fixes in compilation warning
- Fixes in ucs_aarch64_dsb(_op) definition
TCP
- Fixes in default route interface confirmation flow
- Fixes in PUT protocol
- Fixes in max connection limit and improved error reporting
UCM
- Fixing crash on prevent unload
- Fixes in libucm_rocm
- Fixes for few racing conditions