-
Notifications
You must be signed in to change notification settings - Fork 360
Closed
Description
I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.
The following cupy code uses CUB by default on newer versions (ensure with CUPY_ACCELERATORS=cub):
import cupy as cp
x = cp.ones((1000000, 2))
from cupyx.profiler import benchmark
# sum over the last axes which has only two elements:
benchmark(lambda: x.sum(-1), n_repeat=100)
# GPU time spend: 1927.055 us
# Manually do the sum:
benchmark(lambda: x[..., 0] + x[..., 1], n_repeat=100)
# GPU time spend: 56.361 us
Which means a factor of 35 slower than what would be close to optimal.
Now, as a NumPy dev, I accept that NumPy is also still bad at this: by about a factor of 10! CuPy without CUB was good at it, though.
But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done