-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve base step of scan algorithm #1942
Conversation
60f60c6
to
18c5ed7
Compare
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_352 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_351 ran successfully. |
Using ``` import dpctl.tensor as dpt import dpctl x = dpt.ones(2048000, dtype="f4") q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling") xx = x.to_device(q_prof) mm = dpt.cumulative_logsumexp(xx) timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9) with timer(q_prof): for _ in range(250): dpt.cumulative_logsumexp(xx, out=mm) print(f"dpctl.__version__ = {dpctl.__version__}") print(f"Device: {x.sycl_device}") print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}") ``` Testing on Iris Xe from WSL. This branch: ``` $ python ~/cumlogsumexp.py dpctl.__version__ = 0.19.0dev0+351.gffd26092a0.dirty Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Graphics [0x9a49]] at 0x7f37a8f995f0> host_dt=1059589.7079911083, device_dt=1154782.72 ``` vs. main branch: ``` $ python cumlogsumexp.py dpctl.__version__ = 0.19.0dev0+307.g04a8228748 Device: <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu, Intel(R) Graphics [0x9a49]] at 0x7ff6147d3cf0> host_dt=2721938.803792, device_dt=10048323.168 ``` So this is about 8x speed-up.
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_353 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_354 ran successfully. |
The latest addition to this PR also modified custom group function implementing inclusive scan algorithm for a suitable user-defined binary operation, (such as Using the following snippet to micro-benchmark performance import dpctl.tensor as dpt
import dpctl
x = dpt.ones(2048000, dtype="f4")
q_prof = dpctl.SyclQueue(x.sycl_context, x.sycl_device, property="enable_profiling")
xx = x.to_device(q_prof)
mm = dpt.cumulative_logsumexp(xx)
timer = dpctl.SyclTimer(device_timer="order_manager", time_scale=1e9)
with timer(q_prof):
for _ in range(250):
dpt.cumulative_logsumexp(xx, out=mm)
print(f"dpctl.__version__ = {dpctl.__version__}")
print(f"Device: {x.sycl_device}")
print(f"host_dt={timer.dt.host_dt/250}, device_dt={timer.dt.device_dt/250}") Execution on Iris Xe integrated GPU went from 10ms to 1.2ms with this change. Test suite passes on cuda:gpu (RTX 3050 for laptop), opencl:gpu (Iris Xe), level_zero:gpu (Iris Xe, Arc 140V, PVC), and opencl:cpu (Intel Core 7 11th gen, 14th gen, and Xeon Platinum). |
single_step_scan_striped does not produce correct results for wg_size > 64, and tests fail.
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_357 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_362 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM and brings a welcome improvement to the tensor accumulation functions
Change to the base step of iterative scan algorithm. When working on contiguous memory,
for small value of
s1
stride, it is advantageous to load memory using striped data placement,and reorder them in kernel via shared local memory. This provides 40-50% speed-up for
tensor.cumulative_sum
. UsingExecution time measured by
timer.device_dt / 120
drops from 20.9ms in master to 14.3ms with this change on Max GPU.