Skip to content

Performance of small sums could be improved #921

@seberg

Description

@seberg

I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.

The following cupy code uses CUB by default on newer versions (ensure with CUPY_ACCELERATORS=cub):

import cupy as cp
x = cp.ones((1000000, 2))
from cupyx.profiler import benchmark

# sum over the last axes which has only two elements:
benchmark(lambda: x.sum(-1), n_repeat=100)
# GPU time spend: 1927.055 us

# Manually do the sum:
benchmark(lambda: x[..., 0] + x[..., 1], n_repeat=100)
# GPU time spend: 56.361 us

Which means a factor of 35 slower than what would be close to optimal.

Now, as a NumPy dev, I accept that NumPy is also still bad at this: by about a factor of 10! CuPy without CUB was good at it, though.

But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions