Performance of small sums could be improved

I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.

The following cupy code uses CUB by default on newer versions (ensure with `CUPY_ACCELERATORS=cub`):
```
import cupy as cp
x = cp.ones((1000000, 2))
from cupyx.profiler import benchmark

# sum over the last axes which has only two elements:
benchmark(lambda: x.sum(-1), n_repeat=100)
# GPU time spend: 1927.055 us

# Manually do the sum:
benchmark(lambda: x[..., 0] + x[..., 1], n_repeat=100)
# GPU time spend: 56.361 us
```
Which means a factor of 35 slower than what would be close to optimal.

Now, as a NumPy dev, I accept that NumPy is _also_ still bad at this: by about a factor of 10!  CuPy without CUB was good at it, though.

But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of small sums could be improved #921

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance of small sums could be improved #921

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions