Improve base step of scan algorithm #1942

oleksandr-pavlyk · 2024-12-18T21:20:45Z

Change to the base step of iterative scan algorithm. When working on contiguous memory,
for small value of s1 stride, it is advantageous to load memory using striped data placement,
and reorder them in kernel via shared local memory. This provides 40-50% speed-up for
tensor.cumulative_sum. Using

In [2]: import dpctl.tensor as dpt

In [3]: x = dpt.ones(512*1302024, dtype='i4')

In [4]: m = dpt.cumulative_sum(x, dtype="i4")

In [5]: import dpctl

In [6]: q = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")

In [7]: xx = x.to_device(q)

In [8]: mm = m.to_device(q)

In [9]: timer = dpctl.SyclTimer(time_scale=1e9, device_timer="order_manager")

In [10]: with timer(q):
    ...:     for _ in range(120):
    ...:         dpt.cumulative_sum(xx, dtype="i4", out=mm)
    ...:

Execution time measured by timer.device_dt / 120 drops from 20.9ms in master to 14.3ms with this change on Max GPU.

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

github-actions · 2024-12-18T21:58:52Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2024-12-18T22:05:05Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_352 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

coveralls · 2024-12-18T22:07:48Z

coverage: 87.659%. remained the same
when pulling 93eba50 on improve-base-step-of-scan-algorithm
into 25a961f on master.

github-actions · 2024-12-18T22:11:18Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_351 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Using ``` import dpctl.tensor as dpt import dpctl x = dpt.ones(2048000, dtype="f4") q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling") xx = x.to_device(q_prof) mm = dpt.cumulative_logsumexp(xx) timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9) with timer(q_prof): for _ in range(250): dpt.cumulative_logsumexp(xx, out=mm) print(f"dpctl.__version__ = {dpctl.__version__}") print(f"Device: {x.sycl_device}") print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}") ``` Testing on Iris Xe from WSL. This branch: ``` $ python ~/cumlogsumexp.py dpctl.__version__ = 0.19.0dev0+351.gffd26092a0.dirty Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Graphics [0x9a49]] at 0x7f37a8f995f0> host_dt=1059589.7079911083, device_dt=1154782.72 ``` vs. main branch: ``` $ python cumlogsumexp.py dpctl.__version__ = 0.19.0dev0+307.g04a8228748 Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Graphics [0x9a49]] at 0x7ff6147d3cf0> host_dt=2721938.803792, device_dt=10048323.168 ``` So this is about 8x speed-up.

github-actions · 2024-12-20T00:15:47Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_353 ran successfully.
Passed: 895
Failed: 1
Skipped: 118

github-actions · 2024-12-20T16:02:49Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_354 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

oleksandr-pavlyk · 2024-12-20T17:21:11Z

The latest addition to this PR also modified custom group function implementing inclusive scan algorithm for a suitable user-defined binary operation, (such as logaddexp).

Using the following snippet to micro-benchmark performance

import dpctl.tensor as dpt
import dpctl

x = dpt.ones(2048000, dtype="f4")

q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")
xx = x.to_device(q_prof)
mm = dpt.cumulative_logsumexp(xx)

timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9)
with timer(q_prof):
    for _ in range(250):
        dpt.cumulative_logsumexp(xx, out=mm)

print(f"dpctl.__version__ = {dpctl.__version__}")
print(f"Device: {x.sycl_device}")
print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}")

Execution on Iris Xe integrated GPU went from 10ms to 1.2ms with this change.

Test suite passes on cuda:gpu (RTX 3050 for laptop), opencl:gpu (Iris Xe), level_zero:gpu (Iris Xe, Arc 140V, PVC), and opencl:cpu (Intel Core 7 11th gen, 14th gen, and Xeon Platinum).

single_step_scan_striped does not produce correct results for wg_size > 64, and tests fail.

github-actions · 2024-12-22T14:17:57Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_357 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

dpctl/tensor/libtensor/include/kernels/accumulators.hpp

github-actions · 2024-12-26T03:43:10Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_362 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

ndgrigorian

This LGTM and brings a welcome improvement to the tensor accumulation functions

oleksandr-pavlyk requested a review from ndgrigorian as a code owner December 18, 2024 21:20

Add implementation of base scan using striped load/store pattern

18c5ed7

oleksandr-pavlyk force-pushed the improve-base-step-of-scan-algorithm branch from 60f60c6 to 18c5ed7 Compare December 18, 2024 21:27

oleksandr-pavlyk added 2 commits December 19, 2024 13:58

Rename variable in inclusive_scan_base_step_blocked, add const qualifier

ffd2609

Fixed hang on CUDA GPUs in custom scanning code

ce4cf5d

For HIP devices use smaller wg-size parameter

0d59c8f

single_step_scan_striped does not produce correct results for wg_size > 64, and tests fail.

ndgrigorian reviewed Dec 25, 2024

View reviewed changes

dpctl/tensor/libtensor/include/kernels/accumulators.hpp Show resolved Hide resolved

Changed type of wg_size to uint32_t everywhere

93eba50

ndgrigorian approved these changes Dec 27, 2024

View reviewed changes

oleksandr-pavlyk merged commit 39a19c1 into master Dec 27, 2024
61 of 63 checks passed

oleksandr-pavlyk deleted the improve-base-step-of-scan-algorithm branch December 27, 2024 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve base step of scan algorithm #1942

Improve base step of scan algorithm #1942

oleksandr-pavlyk commented Dec 18, 2024

github-actions bot commented Dec 18, 2024 •

edited

Loading

github-actions bot commented Dec 18, 2024

coveralls commented Dec 18, 2024 •

edited

Loading

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

oleksandr-pavlyk commented Dec 20, 2024

github-actions bot commented Dec 22, 2024

github-actions bot commented Dec 26, 2024

ndgrigorian left a comment

Improve base step of scan algorithm #1942

Improve base step of scan algorithm #1942

Conversation

oleksandr-pavlyk commented Dec 18, 2024

github-actions bot commented Dec 18, 2024 • edited Loading

github-actions bot commented Dec 18, 2024

coveralls commented Dec 18, 2024 • edited Loading

github-actions bot commented Dec 18, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

oleksandr-pavlyk commented Dec 20, 2024

github-actions bot commented Dec 22, 2024

github-actions bot commented Dec 26, 2024

ndgrigorian left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 18, 2024 •

edited

Loading

coveralls commented Dec 18, 2024 •

edited

Loading