Expose load/store optimization hints #32

AntonOresten · 2026-01-13T23:22:58Z

Adds two keyword arguments to load/store functions:

latency: value from 1-10 hinting about DRAM traffic of operation (also for gather/scatter)
allow_tma: allow use of Tensor Memory Accelerator (only for load/store)

Read more: https://docs.nvidia.com/cuda/cutile-python/performance.html

This should allow hand-tuning to make some kernels even faster, including #16.

Closes #25

maleadt

LGTM, thanks!

AntonOresten added 2 commits January 14, 2026 01:11

expose load/store/gather/scatter optimization hints

5a603b2

docstring and comment fixes

24a25bb

maleadt approved these changes Jan 14, 2026

View reviewed changes

maleadt merged commit e033d5e into JuliaGPU:main Jan 14, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose load/store optimization hints #32

Expose load/store optimization hints #32

Uh oh!

AntonOresten commented Jan 13, 2026 •

edited

Loading

Uh oh!

maleadt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Expose load/store optimization hints #32

Expose load/store optimization hints #32

Uh oh!

Conversation

AntonOresten commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Jan 13, 2026 •

edited

Loading