forked from tile-ai/tilelang
-
Notifications
You must be signed in to change notification settings - Fork 6
Open
0 / 40 of 4 issues completedOpen
0 / 40 of 4 issues completed
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Required prerequisites
- I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
Motivation
The SUNMMIO ZPU has an explicit DMA engine and a single control stream (No thread/warp schedualing). It depends on the compiler-generated pipeline to overlap DMA operations and compute rather than implicit warp scheduling. Tilelang has provided a seriers of passes to generate pipeline for nvidia's Hopper architecture (TMA supported). We can alter/reuse some of these passes to make them generate the optimal pipline for the SUNMMIO ZPU architecture like the following:
if allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target):
mod = tilelang.transform.MultiVersionBuffer()(mod)
mod = tilelang.transform.InjectDmaBarrier()(mod)
mod = tilelang.transform.PipelinePlanning()(mod)
mod = tilelang.transform.InjectSoftwarePipeline()(mod)Functionality/Pass needed to add/alter:
1.allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target):
Check for DMA capability base on the target type.
MultiVersionBuffer:
Remove Thread/Warp Logic from MultiVersionBuffer. The MultiVersionBuffer pass itself has minimal thread-specific code, but we need to ensure:
- The buffer versioning logic in MultiVersionBufferRewriter works without thread assumptions
- The producer/consumer role detection is still valid for our NPU's DMA operations
InjectDmaBarrier:
We can adapt the existingInjectTmaBarrierpass structure but:
- Remove the thread logic. Remove thread extent tracking and warp specialization checks
- Collect DMA operations that need synchronization
- Map them to barrier IDs
- Either alter the
PipelinePlanning/InjectSoftwarePipelinepasses or create a new passScheduleDmaComputeOverlap:
- Replace thread/warp concepts with DMA channel scheduling (DMA inflight)
- Model DMA transfer latency and compute overlap
- Use DMA start/wait operations to sync, remove GPU-specific thread synchronization
Solution
No response
Alternatives
No response
Additional context
No response
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request