[FEA]: Require ct.barrier for multi stage kernels

### Is this a new feature, an improvement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request?

High

### Please provide a clear description of problem this feature solves

In CUDA programming, we use atomic methods or cooperative groups to synchronize execution across blocks.
cutile could provide a similar mechanism to help developers write complex multi-stage kernels in a simpler way.

### Feature Description

Example:
```python
import torch
import cuda.tile as ct

@ct.kernel
def device_norm(
    x: ct.Array, y: ct.Array, workspace: ct.Array, 
    tile_size: ct.Constant, p: ct.Constant):
    # create a barrier on global memory, except p blocks to reach it.
    barrier = ct.barrier(p=p)
    block_id = ct.bid(0)
    
    tile = ct.load(x, index=(block_id, 0), shape=(1, tile_size))
    mean = ct.sum(tile) / tile_size
    
    ct.atomic_add(workspace, (0, ), mean)
    # wait until p blocks to reach here
    barrier.wait()

    global_mean = ct.load(workspace, (0, ), (1, ))
    global_mean = global_mean / p
    tile = tile - global_mean
    
    ct.store(y, (block_id, ), (tile_size, ))
```

### Describe your ideal solution

Provide ct.barrier, or a similar feature, to make it easier for developers to write applications that require block-level synchronization.

There are multiple ways to implement ct.barrier:
1. Allocate a region in global memory for synchronization, and let each block atomically increment a counter when it reaches the barrier.
2. Use cooperative groups.

### Describe any alternatives you have considered

_No response_

### Additional context

_No response_

### Contributing Guidelines

- [x] I agree to follow cuTile Python's contributing guidelines
- [x] I have searched the [open feature requests](https://github.com/nvidia/cutile-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22feature+request) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: Require ct.barrier for multi stage kernels #37

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request?

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Contributing Guidelines

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: Require ct.barrier for multi stage kernels #37

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request?

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Contributing Guidelines

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions