[Feature Request] [Pass] DMA-aware pipeline generation in OptimizeForTarget phase.

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### Motivation

The SUNMMIO ZPU has an explicit DMA engine and a single control stream (No thread/warp schedualing). It  depends on the compiler-generated pipeline to overlap DMA operations and compute rather than implicit warp scheduling. Tilelang has provided a seriers of passes to generate pipeline for nvidia's Hopper architecture (TMA supported). We can alter/reuse some of these passes to make them generate the optimal pipline for the SUNMMIO ZPU architecture like the following:

```python
    if allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target):
        mod = tilelang.transform.MultiVersionBuffer()(mod)
        mod = tilelang.transform.InjectDmaBarrier()(mod)
        mod = tilelang.transform.PipelinePlanning()(mod)
        mod = tilelang.transform.InjectSoftwarePipeline()(mod)
```

Functionality/Pass needed to add/alter:
1.`allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target)`:
Check for DMA capability base on the target type.

2. `MultiVersionBuffer`:
Remove Thread/Warp Logic from MultiVersionBuffer. The MultiVersionBuffer pass itself has minimal thread-specific code, but we need to ensure:
- The buffer versioning logic in MultiVersionBufferRewriter works without thread assumptions 
- The producer/consumer role detection is still valid for our NPU's DMA operations 

3. `InjectDmaBarrier`:
 We can adapt the existing `InjectTmaBarrier` pass structure but:
- Remove the thread logic. Remove thread extent tracking and warp specialization checks
- Collect DMA operations that need synchronization
- Map them to barrier IDs

4. Either alter the `PipelinePlanning`/`InjectSoftwarePipeline` passes or create a new pass `ScheduleDmaComputeOverlap`:
- Replace thread/warp concepts with DMA channel scheduling (DMA inflight)
- Model DMA transfer latency and compute overlap
- Use DMA start/wait operations to sync, remove GPU-specific thread synchronization

### Solution

_No response_

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [Pass] DMA-aware pipeline generation in OptimizeForTarget phase. #33

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] [Pass] DMA-aware pipeline generation in OptimizeForTarget phase. #33

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions