Replies: 1 comment 1 reply
-
Passing the number of items as a device-accessible-only value is currently not supported. A common way to work around this is to either (a) invoke the algorithm on an upper bound of actual input items and ignore the results past the actual results or (b) copy the device value to the host, as you suggested. I'd suggest, (a) is only applicable if your actual number of items isn't much smaller than the upper bound, or your total problem size isn't really too large (say, around a million items or so). |
Beta Was this translation helpful? Give feedback.
-
I have to do a series of ReduceByKeys, which all work on the output of the previous ReduceByKey call.
Is it possible to feed the count of keys in d_num_runs_out into the num_items parameter of the next call without moving it into host memory?
I'm trying to write an nbody/sph solver, would need to build my oct tree via a series of ReduceByKeys, so these reductions need to happen between each rendered frame.
Doing a cudaMemcpy for that single counter takes several milliseconds sometimes (starting to doubt if i measure timing right but it seems to be very expensive).
Sorry, probably my question relates more to my inexperience with CUDA and its memory handling, but it seems too specific to CUB for a generic forum.
Beta Was this translation helpful? Give feedback.
All reactions