Naive register <-> tmem load/store support #3786

zasdfgbnm · 2025-01-29T05:58:20Z

Extracted from #3755 to make code review easy.

This PR adds a new unit test TMemTest.GmemRegTMemRegGmemCopy that schedules a copy kernel gmem -> register -> tmem -> register -> gmem, and update our system with the minimum required changes to make this test pass.

The purpose of this PR is not to provide a good implementation of TMem support, but just to provide the absolute minimal requirement for us to start. Limitations are:

The index is hard coded zero, so this PR is not touching the interesting topic of "how to schedule TMem tensor?"
The TMem is used without allocation. Using a memory that is not allocated is clearly a wrong way to program, but as described in the code comment, if a fusion only has one TMem TensorView, it is guaranteed to work.

Generated code:

__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4) {
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x));
  bool b1;
  b1 = i0 < T0.logical_size[0LL];
  Array<float, 1, 1> T1;
  T1[0] = 0;
  if (b1) {
    T1[0]
       = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"(0U),
     "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))[0])
  );
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
  Array<float, 1, 1> T3;
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0]))[0])
    :"r"(0U)
  );
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
  if (b1) {
    T4[i0]
       = T3[0];
  }
}

github-actions · 2025-01-29T05:59:15Z

PR Reviewer Guide 🔍

(Review updated until commit `1b1f4cc`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

⚡ Recommended focus areas for review

Missing Allocation

The current implementation does not allocate tensor memory, which may lead to issues with multiple CTAs accessing the same memory.

TensorMemoryInfo computeTMemInfo(Fusion* fusion) {
  bool found = false;
  for (auto tv : fusion->allTvs()) {
    if (tv->getMemoryType() == MemoryType::Tensor) {
      NVF_ERROR(!found, "Only one tensor on TMem is supported");
      found = true;
    }
  }
  return {};
}

Hardcoded Index

The index is hardcoded to zero, which may not be the intended behavior for all use cases.

// TODO: hard coded index zero for now.
auto index = IrBuilder::create<Val>(0, DataType::UInt32);
in = IrBuilder::create<kir::TensorIndex>(

Limited Support

The current implementation only supports 32-bit types in tensor memory, which may limit its usability.

  // TODO: support other types of ld/st
  auto ptx = "tcgen05.ld.sync.aligned.32x32b.x1.b32";
  registerReplace(
      ldst,
      IrBuilder::create<kir::Asm>(
          ptx,
          std::vector<Val*>{ldst->out()},
          std::vector<Val*>{ldst->in()}));
  auto wait_ptx = "tcgen05.wait::ld.sync.aligned";
  registerInsertAfter(
      ldst,
      IrBuilder::create<kir::Asm>(
          wait_ptx,
          std::vector<Val*>{},
          std::vector<Val*>{},
          kir::Asm::Options{/*volatile=*/true}));
} else if (ldst->opType() == LoadStoreOpType::StTMem) {
  // TODO: support other types of ld/st
  auto ptx = "tcgen05.st.sync.aligned.32x32b.x1.b32";
  registerReplace(
      ldst,
      IrBuilder::create<kir::Asm>(
          ptx,
          std::vector<Val*>{},
          std::vector<Val*>{ldst->out(), ldst->in()},
          kir::Asm::Options{/*volatile=*/true}));
  auto wait_ptx = "tcgen05.wait::st.sync.aligned";
  registerInsertAfter(
      ldst,
      IrBuilder::create<kir::Asm>(
          wait_ptx,
          std::vector<Val*>{},
          std::vector<Val*>{},
          kir::Asm::Options{/*volatile=*/true}));

zasdfgbnm · 2025-01-29T22:59:44Z

!test

csrc/device_lower/pass/index.cpp

naoyam

Overall LGTM. Just left one small question.

rdspring1

Do you plan to wrap the ptx assemble with cuda functions?

rdspring1 · 2025-01-31T18:33:56Z

csrc/device_lower/analysis/tensor_memory.h

+// different CTA to different physical address. There is no virtual TMem
+// address. All addresses are physical addresses.
+//
+// Because multiple CTAs can execute on the same SM simultaneously, there must


Due to this handshaking mechanism, is it better to have only a single CTA occupy an SM?

Are you talking about kernel design for better perf? My guess is, if you allocate at the beginning of the kernel, and relinquish after allocate, the latency should be acceptable if you want to use multiple CTA on SM. But we need to test it before making any conclusion.

Yes, for maximum performance.

zasdfgbnm · 2025-01-31T18:59:08Z

Do you plan to wrap the ptx assemble with cuda functions?

I like kir::Asm more than CUDA functions. Our generated cu file is already 10k+ lines of code, and I don't want to add more unless necessary.

zasdfgbnm added 2 commits January 28, 2025 21:57

Naive register <-> tmem load/store support.

0b2e784

tmem

0610fc0

zasdfgbnm added 3 commits January 28, 2025 22:11

format

f721504

skip on non-blackwell

8a7f03f

comment

225b35b

zasdfgbnm marked this pull request as ready for review January 29, 2025 06:20

zasdfgbnm requested review from naoyam, jacobhinkle and rdspring1 January 29, 2025 06:20

zasdfgbnm added 2 commits January 29, 2025 14:51

Merge remote-tracking branch 'origin/main' into tmem-no-alloc

3b789d6

fix

1b1f4cc

zasdfgbnm mentioned this pull request Jan 29, 2025

tcgen05.alloc TMem usage #3795

Open

naoyam reviewed Jan 30, 2025

View reviewed changes

csrc/device_lower/pass/index.cpp Show resolved Hide resolved

naoyam approved these changes Jan 30, 2025

View reviewed changes

zasdfgbnm merged commit 149c163 into main Jan 30, 2025
51 checks passed

zasdfgbnm deleted the tmem-no-alloc branch January 30, 2025 08:23

rdspring1 reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive register <-> tmem load/store support #3786

Naive register <-> tmem load/store support #3786

zasdfgbnm commented Jan 29, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025 •

edited

Loading

zasdfgbnm commented Jan 29, 2025

naoyam left a comment

rdspring1 left a comment

rdspring1 Jan 31, 2025

zasdfgbnm Jan 31, 2025

rdspring1 Jan 31, 2025

zasdfgbnm commented Jan 31, 2025

Naive register <-> tmem load/store support #3786

Naive register <-> tmem load/store support #3786

Conversation

zasdfgbnm commented Jan 29, 2025 • edited Loading

github-actions bot commented Jan 29, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit 1b1f4cc)

zasdfgbnm commented Jan 29, 2025

naoyam left a comment

Choose a reason for hiding this comment

rdspring1 left a comment

Choose a reason for hiding this comment

rdspring1 Jan 31, 2025

Choose a reason for hiding this comment

zasdfgbnm Jan 31, 2025

Choose a reason for hiding this comment

rdspring1 Jan 31, 2025

Choose a reason for hiding this comment

zasdfgbnm commented Jan 31, 2025

zasdfgbnm commented Jan 29, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025 •

edited

Loading

(Review updated until commit `1b1f4cc`)