Feed DeviceReduce/DeviceRadixSort num_items from data on device #1347

stndbye · 2024-02-07T11:09:32Z

stndbye
Feb 7, 2024

I have to do a series of ReduceByKeys, which all work on the output of the previous ReduceByKey call.
Is it possible to feed the count of keys in d_num_runs_out into the num_items parameter of the next call without moving it into host memory?

I'm trying to write an nbody/sph solver, would need to build my oct tree via a series of ReduceByKeys, so these reductions need to happen between each rendered frame.
Doing a cudaMemcpy for that single counter takes several milliseconds sometimes (starting to doubt if i measure timing right but it seems to be very expensive).
Sorry, probably my question relates more to my inexperience with CUDA and its memory handling, but it seems too specific to CUB for a generic forum.

elstehle · 2024-02-07T12:18:40Z

elstehle
Feb 7, 2024
Collaborator

Passing the number of items as a device-accessible-only value is currently not supported. A common way to work around this is to either (a) invoke the algorithm on an upper bound of actual input items and ignore the results past the actual results or (b) copy the device value to the host, as you suggested.

I'd suggest, (a) is only applicable if your actual number of items isn't much smaller than the upper bound, or your total problem size isn't really too large (say, around a million items or so).
As optimizations for (b), I would allocate some page-locked memory on the host using cudaMallocHost and use that as destination in cudaMemcpyAsync. I would suggest using cudaMemcpyAsync instead of cudaMemcpy, and use the same stream as you use for your ReduceByKey. Before you use the copied value on the host, use cudaStreamSynchronize to make sure that value is really available on the host before passing it to a subsequent ReduceByKey invocation.

1 reply

jrhemstad Feb 7, 2024
Maintainer

@elstehle's answer is a correct summary of the best you can do today, but @stndbye you are right to feel like there should be a better way. This is part of a larger open design question of how to do "deferred" arguments to CUB algorithms.

There were similar asks to enable the initial value used for DeviceScan and DeviceReduce to come from device memory, see
#884
NVIDIA/cub#294

It looks like we added an overload for DeviceScan that takes a FutureValue type in NVIDIA/cub#305, and there was a PR that was never merged to do the same for DeviceReduce.

A key difference between making num_items and the init_val a deferred value is that num_items impacts your kernel launch configuration, i.e., how many blocks do you launch in your kernel if you don't know what num_items is when you launch the kernel?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feed DeviceReduce/DeviceRadixSort num_items from data on device #1347

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Feed DeviceReduce/DeviceRadixSort num_items from data on device #1347

stndbye Feb 7, 2024

Replies: 1 comment · 1 reply

elstehle Feb 7, 2024 Collaborator

jrhemstad Feb 7, 2024 Maintainer

stndbye
Feb 7, 2024

Replies: 1 comment 1 reply

elstehle
Feb 7, 2024
Collaborator

jrhemstad Feb 7, 2024
Maintainer