This repository was archived by the owner on Mar 21, 2024. It is now read-only.
CUB 1.15.0 (NVIDIA HPC SDK 21.11) #390
alliepiper
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
CUB 1.15.0 accompanies the NVIDIA HPC SDK 21.11 release. It includes a new
cub::DeviceSegmentedSort
algorithm, which demonstrates up to 5000x speedup compared tocub::DeviceSegmentedRadixSort
when sorting a large number of small segments. A newcub::FutureValue<T>
helper allows thecub::DeviceScan
algorithms to lazily load theinitial_value
from a pointer.cub::DeviceScan
also addedScanByKey
functionality.The new
DeviceSegmentedSort
algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.cub::FutureValue<T>
provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
Deprecation Notices
A future version of CUB will change the
debug_synchronous
behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).This will only affect calls to CUB device-scope algorithms launched from device-side code with
debug_synchronous = true
. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.Breaking Changes
cub::DispatchScan
have changed to support the newcub::FutureValue
helper. More details under "New Features".operator->()
#377: Remove brokenoperator->()
fromcub::TransformInputIterator
, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.New Features
cub::DeviceScan
algorithms that allow the output of a previous kernel to be used asinitial_value
without explicit synchronization. See the newcub::FutureValue
helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.cub::BlockRunLengthDecode
algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.cub::DeviceSegmentedSort
, an optimized version ofcub::DeviceSegmentedSort
with improved load balancing and small array performance.cub::DeviceScan
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.Bug Fixes
cub::DeviceMergeSort
algorithms.-Wconversion
warnings. Thanks to Matt Stack (@matt-stack) for this contribution.cub::CachingDeviceAllocator
.This discussion was created from the release CUB 1.15.0 (NVIDIA HPC SDK 21.11).
Beta Was this translation helpful? Give feedback.
All reactions