Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve base step of scan algorithm #1942

Merged
merged 6 commits into from
Dec 27, 2024

Conversation

oleksandr-pavlyk
Copy link
Collaborator

Change to the base step of iterative scan algorithm. When working on contiguous memory,
for small value of s1 stride, it is advantageous to load memory using striped data placement,
and reorder them in kernel via shared local memory. This provides 40-50% speed-up for
tensor.cumulative_sum. Using

In [2]: import dpctl.tensor as dpt

In [3]: x = dpt.ones(512*1302024, dtype='i4')

In [4]: m = dpt.cumulative_sum(x, dtype="i4")

In [5]: import dpctl

In [6]: q = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")

In [7]: xx = x.to_device(q)

In [8]: mm = m.to_device(q)

In [9]: timer = dpctl.SyclTimer(time_scale=1e9, device_timer="order_manager")

In [10]: with timer(q):
    ...:     for _ in range(120):
    ...:         dpt.cumulative_sum(xx, dtype="i4", out=mm)
    ...:

Execution time measured by timer.device_dt / 120 drops from 20.9ms in master to 14.3ms with this change on Max GPU.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the improve-base-step-of-scan-algorithm branch from 60f60c6 to 18c5ed7 Compare December 18, 2024 21:27
Copy link

github-actions bot commented Dec 18, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_352 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

@coveralls
Copy link
Collaborator

coveralls commented Dec 18, 2024

Coverage Status

coverage: 87.659%. remained the same
when pulling 93eba50 on improve-base-step-of-scan-algorithm
into 25a961f on master.

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_351 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Using

```
import dpctl.tensor as dpt
import dpctl

x = dpt.ones(2048000, dtype="f4")

q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")
xx = x.to_device(q_prof)
mm = dpt.cumulative_logsumexp(xx)

timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9)
with timer(q_prof):
    for _ in range(250):
        dpt.cumulative_logsumexp(xx, out=mm)

print(f"dpctl.__version__ = {dpctl.__version__}")
print(f"Device: {x.sycl_device}")
print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}")
```

Testing on Iris Xe from WSL.

This branch:

```
$ python ~/cumlogsumexp.py
dpctl.__version__ = 0.19.0dev0+351.gffd26092a0.dirty
Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) Graphics [0x9a49]] at 0x7f37a8f995f0>
host_dt=1059589.7079911083, device_dt=1154782.72
```

vs. main branch:

```
$ python cumlogsumexp.py
dpctl.__version__ = 0.19.0dev0+307.g04a8228748
Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) Graphics [0x9a49]] at 0x7ff6147d3cf0>
host_dt=2721938.803792, device_dt=10048323.168
```

So this is about 8x speed-up.
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_353 ran successfully.
Passed: 895
Failed: 1
Skipped: 118

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_354 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

@oleksandr-pavlyk
Copy link
Collaborator Author

The latest addition to this PR also modified custom group function implementing inclusive scan algorithm for a suitable user-defined binary operation, (such as logaddexp).

Using the following snippet to micro-benchmark performance

import dpctl.tensor as dpt
import dpctl

x = dpt.ones(2048000, dtype="f4")

q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")
xx = x.to_device(q_prof)
mm = dpt.cumulative_logsumexp(xx)

timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9)
with timer(q_prof):
    for _ in range(250):
        dpt.cumulative_logsumexp(xx, out=mm)

print(f"dpctl.__version__ = {dpctl.__version__}")
print(f"Device: {x.sycl_device}")
print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}")

Execution on Iris Xe integrated GPU went from 10ms to 1.2ms with this change.

Test suite passes on cuda:gpu (RTX 3050 for laptop), opencl:gpu (Iris Xe), level_zero:gpu (Iris Xe, Arc 140V, PVC), and opencl:cpu (Intel Core 7 11th gen, 14th gen, and Xeon Platinum).

single_step_scan_striped does not produce correct results for
wg_size > 64, and tests fail.
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_357 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_362 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and brings a welcome improvement to the tensor accumulation functions

@oleksandr-pavlyk oleksandr-pavlyk merged commit 39a19c1 into master Dec 27, 2024
61 of 63 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the improve-base-step-of-scan-algorithm branch December 27, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants