Skip to content

[Feature Request] [Pass] DMA-aware pipeline generation in OptimizeForTarget phase. #33

@firefrogliu666

Description

@firefrogliu666

Required prerequisites

  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)

Motivation

The SUNMMIO ZPU has an explicit DMA engine and a single control stream (No thread/warp schedualing). It depends on the compiler-generated pipeline to overlap DMA operations and compute rather than implicit warp scheduling. Tilelang has provided a seriers of passes to generate pipeline for nvidia's Hopper architecture (TMA supported). We can alter/reuse some of these passes to make them generate the optimal pipline for the SUNMMIO ZPU architecture like the following:

    if allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target):
        mod = tilelang.transform.MultiVersionBuffer()(mod)
        mod = tilelang.transform.InjectDmaBarrier()(mod)
        mod = tilelang.transform.PipelinePlanning()(mod)
        mod = tilelang.transform.InjectSoftwarePipeline()(mod)

Functionality/Pass needed to add/alter:
1.allow_dma_and_async_copy(pass_ctx=pass_ctx, target=target):
Check for DMA capability base on the target type.

  1. MultiVersionBuffer:
    Remove Thread/Warp Logic from MultiVersionBuffer. The MultiVersionBuffer pass itself has minimal thread-specific code, but we need to ensure:
  • The buffer versioning logic in MultiVersionBufferRewriter works without thread assumptions
  • The producer/consumer role detection is still valid for our NPU's DMA operations
  1. InjectDmaBarrier:
    We can adapt the existing InjectTmaBarrier pass structure but:
  • Remove the thread logic. Remove thread extent tracking and warp specialization checks
  • Collect DMA operations that need synchronization
  • Map them to barrier IDs
  1. Either alter the PipelinePlanning/InjectSoftwarePipeline passes or create a new pass ScheduleDmaComputeOverlap:
  • Replace thread/warp concepts with DMA channel scheduling (DMA inflight)
  • Model DMA transfer latency and compute overlap
  • Use DMA start/wait operations to sync, remove GPU-specific thread synchronization

Solution

No response

Alternatives

No response

Additional context

No response

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions