-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Labels
feature requestpriority: P1status: triagedReviewed by maintainers and assignedReviewed by maintainers and assigned
Description
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request?
High
Please provide a clear description of problem this feature solves
In CUDA programming, we use atomic methods or cooperative groups to synchronize execution across blocks.
cutile could provide a similar mechanism to help developers write complex multi-stage kernels in a simpler way.
Feature Description
Example:
import torch
import cuda.tile as ct
@ct.kernel
def device_norm(
x: ct.Array, y: ct.Array, workspace: ct.Array,
tile_size: ct.Constant, p: ct.Constant):
# create a barrier on global memory, except p blocks to reach it.
barrier = ct.barrier(p=p)
block_id = ct.bid(0)
tile = ct.load(x, index=(block_id, 0), shape=(1, tile_size))
mean = ct.sum(tile) / tile_size
ct.atomic_add(workspace, (0, ), mean)
# wait until p blocks to reach here
barrier.wait()
global_mean = ct.load(workspace, (0, ), (1, ))
global_mean = global_mean / p
tile = tile - global_mean
ct.store(y, (block_id, ), (tile_size, ))Describe your ideal solution
Provide ct.barrier, or a similar feature, to make it easier for developers to write applications that require block-level synchronization.
There are multiple ways to implement ct.barrier:
- Allocate a region in global memory for synchronization, and let each block atomically increment a counter when it reaches the barrier.
- Use cooperative groups.
Describe any alternatives you have considered
No response
Additional context
No response
Contributing Guidelines
- I agree to follow cuTile Python's contributing guidelines
- I have searched the open feature requests and have found no duplicates for this feature request
Metadata
Metadata
Assignees
Labels
feature requestpriority: P1status: triagedReviewed by maintainers and assignedReviewed by maintainers and assigned