Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor memory support 1 #3755

Closed
wants to merge 25 commits into from
Closed

Tensor memory support 1 #3755

wants to merge 25 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

Not ready for review

Copy link

github-actions bot commented Jan 24, 2025

PR Reviewer Guide 🔍

(Review updated until commit f9adf69)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 PR contains tests
⚡ Recommended focus areas for review

Memory Leak

The computeTMemInfo function allocates memory for TensorMemoryInfo and TMemAlllocationInfo but does not deallocate it. This could lead to memory leaks.

TensorMemoryInfo computeTMemInfo(Fusion* fusion) {
  TensorMemoryInfo result;

  // Compute the allocation information for tensor memory. Currently, we use a
  // very simple heuristic that assign a separate region for each TensorView.
  // See note [Tensor Memory Allocation] for the overall design.
  auto& regions = result.allocation.regions;
  for (auto tv : fusion->allTvs()) {
    if (tv->getMemoryType() != MemoryType::Tensor) {
      continue;
    }
    regions.emplace_back();
    auto& region = regions.back();

    region.address = TensorViewBuilder()
                         .shape(std::vector<Val*>{})
                         .dtype(DataType::UInt32)
                         .build();
    region.address->setMemoryType(MemoryType::Shared);

    // TODO: right now we hardcode the number of columns to be 32. This is
    // definitely not correct.
    region.num_columns = IrBuilder::create<Val>(32, DataType::UInt32);

    region.covered_tensors.emplace_back();
    auto& covered_tensor = region.covered_tensors.back();
    covered_tensor.tensor = tv;
    covered_tensor.lane_offset = tv->fusion()->zeroVal(DataType::UInt16);
    covered_tensor.column_offset = tv->fusion()->zeroVal(DataType::UInt16);
  }

  return result;
}
Potential Bug

The insertTMemRegionAllocsAndDeallocs function inserts deallocations after the outermost serial loop containing the last read of each TMem region. However, it does not check if the deallocation is already inserted before inserting it again. This could lead to duplicate deallocations.

    return inserter.exprs_;
  }
};

// Insert IR nodes that allocate and deallocate TMem regions.
// See note [Tensor Memory Allocation] for the overall design.
// We insert the tcgen05.allocs of each region and the relinquish of the right
// to allocate at the beginning of the top-level scope of the kernel. We insert
// the tcgen05.deallocs after the outermost serial loop containing the last read
// of each TMem region into whatever scope containing this outermost serial
// loop. The allocation of each TMem TensorView within each region is inserted
// by AllocationInserter::insert, therefore not handled here.
std::vector<Expr*> insertTMemRegionAllocsAndDeallocs(
    const std::vector<Expr*>& exprs) {
  // Expressions to be inserted at the beginning of the top-level scope.
  std::list<Expr*> prologue;
  {
    // For each TMem region, allocate its address in shared memory, and insert
    // the tcgen05.alloc for tensor memory allocation.
    for (const auto& region :
         GpuLower::current()->tmemInfo().allocation.regions) {
      // kir::Allocate for the address tensor on shared memory
      auto address_alloc_expr =
          IrBuilder::create<kir::Allocate>(region.address, MemoryType::Shared);
      prologue.push_back(address_alloc_expr);
      // the tcgen05.alloc instruction
      auto alloc_expr =
          IrBuilder::create<kir::AllocTMem>(region.address, region.num_columns);
      prologue.push_back(alloc_expr);
    }

    // Relinquish the right to allocate after all regions have been allocated
    auto tcgen05_relinquish_expr = IrBuilder::create<kir::Asm>(
        "tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned",
        std::vector<Val*>{},
        std::vector<Val*>{},
        kir::Asm::Options{/*volatile=*/true});
    prologue.push_back(tcgen05_relinquish_expr);

      // Block sync that makes allocation visible to all threads
      auto block_sync = IrBuilder::create<kir::BlockSync>();
      prologue.push_back(block_sync);
    }
  }

  // Add deallocations to existing expressions
  std::vector<Expr*> exprs_with_deallocs;
  {
    class DeallocInserter : public kir::ExprMutator {
      // A map:
      //   region -> a function that registers the deallocation expression for
      //             this region
      //
      // This map is updated during traversal. For example, if we have a kernel
      // like below:
      //   ...
      //   T1_t = T0_r; // expr1
      //   ...
      //   T2_r = T1_t; // expr2
      // Assume that T1_t is in region R1. Then after we handle(expr1), we will
      // have an entry:
      //    R1 -> a function registering insertion of "dealloc R1" after expr1
      // After handle(expr2), this entry becomes:
      //    R1 -> a function registering insertion of "dealloc R1" after expr2
      //
      // After traversing the entire kernel, this map will contain the final
      // register functions we want to execute.
      std::unordered_map<
          const TMemAlllocationInfo::Region*,
          std::function<void()>>
          region_to_register_dealloc_map_;

      // A map:
      //   expr -> the regions that this expr is accessing
      // Note that if expr is a container such as ForLoop or IfThenElse, then
      // the mapped regions will be all the regions the contained exprs are
      // accessing.
      //
      // This map only contain information of accesses that we have discovered,
      // and is updated during traversal. For example, if we have a kernel:
      //   ForLoop: // loop1
      //     T2_t = T0_r; // expr1
      //     ...
      //     T3_t = T1_r; // expr2
      // Assume T2_t is in region R2 and T3_t is in region R3. Then after
      // handle(expr1), we will have a map:
      //    expr1 -> {R2}
      //    loop1 -> {R2}
      // After handle(expr2), this map becomes:
      //    expr1 -> {R2}
      //    expr2 -> {R3}
      //    loop1 -> {R2, R3}
      std::unordered_map<
          Expr*,
          VectorOfUniqueEntries<const TMemAlllocationInfo::Region*>>
          access_map_;

      // Analyze expr to see if it has any accesses to tensor memory. If yes
      // update the access map for this expr and its container exprs.
      void updateAccessMap(Expr* expr) {
        std::unordered_set<Val*> io_vals;
        std::copy(
            expr->inputs().begin(),
            expr->inputs().end(),
            std::inserter(io_vals, io_vals.end()));
        std::copy(
            expr->outputs().begin(),
            expr->outputs().end(),
            std::inserter(io_vals, io_vals.end()));
        if (io_vals.empty()) {
          return;
        }
        for (const auto& region :
             GpuLower::current()->tmemInfo().allocation.regions) {
          for (auto tv_info : region.covered_tensors) {
            if (io_vals.count(tv_info.tensor)) {
              access_map_[expr].pushBack(&region);
              for (auto container : scope_exprs_) {
                access_map_[container].pushBack(&region);
              }
            }
            break;
          }
        }
      }

      // Update the region_to_register_dealloc_map_ to register insertion of
      // deallocation expression after expr for the regions accessed by expr.
      void maybeRegisterDeallocsAfterExpr(Expr* expr) {
        // If expr is a trivial for loop, then we don't need to move the
        // deallocation after it. This is because the trivial is not generated
        // in the final code.
        if (auto fl = dynamic_cast<ForLoop*>(expr)) {
          if (fl->isTrivial()) {
            return;
          }
        }
        // If expr is not accessing any tensor memory, then nothing to do.
        if (!access_map_.count(expr)) {
          return;
        }
        for (auto region : access_map_.at(expr)) {
          auto current_scope = scope_.empty() ? nullptr : scope_.back();
          region_to_register_dealloc_map_[region] =
              [this, expr, region, current_scope]() {
                auto tcgen05_dealloc_expr = IrBuilder::create<kir::Asm>(
                    "tcgen05.dealloc.cta_group::1.sync.aligned.b32",
                    std::vector<Val*>{},
                    std::vector<Val*>{
                        IrBuilder::create<kir::TensorIndex>(
                            region->address, expr->fusion()->zeroVal()),
                        region->num_columns},
                    kir::Asm::Options{/*volatile=*/true});
                registerInsertAfter(expr, tcgen05_dealloc_expr, current_scope);
              };
        }
      }

      void dispatch(Expr* expr) final {
        updateAccessMap(expr);
        ExprMutator::dispatch(expr);
        maybeRegisterDeallocsAfterExpr(expr);
      }

     public:
      DeallocInserter(
          const std::vector<Expr*>& exprs,
          std::vector<Expr*>& exprs_with_deallocs) {
        handle(exprs);
        for (const auto& region :
             GpuLower::current()->tmemInfo().allocation.regions) {
          region_to_register_dealloc_map_.at (&region)();
        }
        exprs_with_deallocs = mutate();
      }
    } inserter(exprs, exprs_with_deallocs);
  }

  // Combine prologue and exprs_with_deallocs
  std::vector<Expr*> result;
  result.reserve(prologue.size() + exprs_with_deallocs.size());
  result.insert(result.end(), prologue.begin(), prologue.end());
  result.insert(
      result.end(), exprs_with_deallocs.begin(), exprs_with_deallocs.end());
  return result;
}

} // namespace
Inconsistent Error Handling

The Allocate class has inconsistent error handling. Some methods throw exceptions while others return error codes. It would be better to have a consistent error handling approach throughout the class.

  Val* size() const {
    return input(0);
  }

  //! Size of each dimension
  std::vector<Val*> shape() const {
    constexpr int64_t num_attributes_before_shape = 8;
    std::vector<Val*> result;
    result.reserve(attributes().size() - num_attributes_before_shape);
    for (auto i = attributes().begin() + num_attributes_before_shape;
         i != attributes().end();
         ++i) {
      result.emplace_back((*i)->as<Val>());
    }
    return result;
  }

  //! Does this allocation require its memory to be initialized to zero before
  //! this kernel is launched? If this is true, then an additional memset
  //! kernel might be launched before the current Fusion kernel is launched in
  //! order to guarantee that this buffer is filled with zeroes (see
  //! resetsToZero() below).
  bool zeroInit() const {
    return attribute<bool>(2);
  }

  //! Is this buffer guaranteed to be reset to all zero values at the end of
  //! this kernel? This is used to avoid an additional memset kernel launch for
  //! buffers that require zeroed memory (see zeroInit() above).
  //!
  //! A common use case for zeroInit() allocations is semaphore buffers that
  //! hold counters starting at zero. Typically, each participating thread would
  //! increment the counter and the last thread would leave the counter in a
  //! non-zeroed state. The next time that kernel is run, it can no longer
  //! re-use the non-zero semaphore buffer, so KernelExecutor will launch
  //! at::zeroes to allocate a new buffer, resulting in a memset kernel launch.
  //!
  //! Instead, if the last thread resets the counter to zero, then the buffer
  //! can be re-used, and at::zeroes need only be run at the first kernel
  //! launch. If resetsToZero() is true, then KernelExecutor will use
  //! contigZeroedTensor() and releaseZeroedMemory() from global_allocator.h to
  //! reuse zeroed memory avoiding the additional kernel launch.
  //!
  //! Whenever possible, we should try to guarantee that resetsToZero() is true
  //! if zeroInit() is true by modifying our code to clean up global counters,
  //! because the latency penalty of an additional kernel launch should be
  //! greater than that required to reset this memory at the end of the fusion.
  //! The exception is when a kernel is launched only a single time, in which
  //! case resetting the memory is unnecessary, but we expect that kernels will
  //! instead be launched many times.
  bool resetsToZero() const {
    return attribute<bool>(3);
  }

  // This alias tracks the next Allocate node in a linked chain of aliases
  // If the alias is nullptr, then the Allocate node uses memory in the kernel
  const Allocate* alias() const {
    return dynamic_cast<const Allocate*>(attribute(4));
  }

  // Set the address of a shared memory allocation within the dynamic shared
  // memory array. The addr argument should be a scalar expression describing an
  // aligned address in bytes.
  void setAddress(Val* addr) {
    NVF_CHECK(
        memoryType() == MemoryType::Shared ||
            memoryType() == MemoryType::Tensor,
        "Allocation address may only be set for shared/tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        address() == nullptr,
        "Attempted to set address twice for allocation ",
        toString());
    attributes_[5] = addr;
  }

  void setLaneOffset(Val* lane_offset) {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Lane offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        laneOffset() == nullptr,
        "Attempted to set lane offset twice for allocation ",
        toString());
    attributes_[6] = lane_offset;
  }

  void setColOffset(Val* col_offset) {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Column offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        colOffset() == nullptr,
        "Attempted to set column offset twice for allocation ",
        toString());
    attributes_[7] = col_offset;
  }

  // This is an integer scalar describing the byte address within the dynamic
  // shared memory array for a shared memory allocation. For memory types other
  // than Shared, or before allocation, this function might return nullptr.
  Val* address() const {
    NVF_CHECK(
        memoryType() == MemoryType::Shared ||
            memoryType() == MemoryType::Tensor,
        "Allocation address may only be set for shared memory allocations. Memory type is ",
        memoryType());
    return attributeVal(5);
  }

  Val* laneOffset() const {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Lane offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    return attributeVal(6);
  }

  Val* colOffset() const {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Column offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    return attributeVal(7);
  }
};

testValidate(&fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
}

TEST_F(TMemTest, AddKernel) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) {
  alignas(16) extern __shared__ char array[];
  const unsigned smem_offset = 0;
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.x) + (32LL * ((nvfuser_index_t)blockIdx.x));
  bool b1;
  b1 = i0 < T0.logical_size[0LL];
  uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 16LL);
  asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T10))), "n"(32U));
  uint32_t* T11 = reinterpret_cast<uint32_t*>(array + smem_offset + 0LL);
  asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T11))), "n"(32U));
  asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n");
  __syncthreads();
  float T1[1LL];
  T1[0LL] = 0LL;
  if (b1) {
    T1[0LL]
       = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  TMemTensor T2(T10[0LL], 0, 0);
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})),
     "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0LL]))[0])
  );
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
  float T3[1LL];
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0LL]))[0])
    :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}))
  );
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
  asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0LL]), "n"(32U));
  float T5[1LL];
  T5[0LL] = 0LL;
  if (b1) {
    T5[0LL]
       = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  TMemTensor T6(T11[0LL], 0, 0);
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})),
     "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T5[0LL]))[0])
  );
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
  float T7[1LL];
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0LL]))[0])
    :"r"((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}))
  );
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
  asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T11[0LL]), "n"(32U));
  float T8[1LL];
  T8[0LL]
    = T3[0LL]
    + T7[0LL];
  if (b1) {
    T9[i0]
       = T8[0LL];
  }
}

// Tensor memory tests
using TMemTest = NVFuserTest;

TEST_F(TMemTest, GmemRegTMemRegGmemCopy) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4) {
  alignas(16) extern __shared__ char array[];
  const unsigned smem_offset = 0;
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.x) + (32LL * ((nvfuser_index_t)blockIdx.x));
  bool b1;
  b1 = i0 < T0.logical_size[0LL];
  uint32_t* T5 = reinterpret_cast<uint32_t*>(array + smem_offset + 0LL);
  asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T5))), "n"(32U));
  asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n");
  __syncthreads();
  float T1[1LL];
  T1[0LL] = 0LL;
  if (b1) {
    T1[0LL]
       = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  TMemTensor T2(T5[0LL], 0, 0);
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})),
     "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0LL]))[0])
  );
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
  float T3[1LL];
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0LL]))[0])
    :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}))
  );
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
  asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T5[0LL]), "n"(32U));
  if (b1) {
    T4[i0]
       = T3[0LL];
  }
}

zasdfgbnm added a commit that referenced this pull request Jan 29, 2025
Extracted from #3755 to make it
easier to review.
zasdfgbnm added a commit that referenced this pull request Jan 29, 2025
@zasdfgbnm zasdfgbnm changed the base branch from main to tmem-no-alloc January 29, 2025 20:05
@zasdfgbnm zasdfgbnm changed the base branch from tmem-no-alloc to alloc-tmem January 30, 2025 00:55
zasdfgbnm added a commit that referenced this pull request Jan 30, 2025
Extracted from #3755 to make code
review easy.

This PR adds a new unit test `TMemTest.GmemRegTMemRegGmemCopy` that
schedules a copy kernel gmem -> register -> tmem -> register -> gmem,
and update our system with the minimum required changes to make this
test pass.

The purpose of this PR is not to provide a good implementation of TMem
support, but just to provide the absolute minimal requirement for us to
start. Limitations are:
1. The index is hard coded zero, so this PR is not touching the
interesting topic of "how to schedule TMem tensor?"
2. The TMem is used without allocation. Using a memory that is not
allocated is clearly a wrong way to program, but as described in the
code comment, if a fusion only has one TMem TensorView, it is guaranteed
to work.

Generated code:
```CUDA
__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4) {
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x));
  bool b1;
  b1 = i0 < T0.logical_size[0LL];
  Array<float, 1, 1> T1;
  T1[0] = 0;
  if (b1) {
    T1[0]
       = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"(0U),
     "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))[0])
  );
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
  Array<float, 1, 1> T3;
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0]))[0])
    :"r"(0U)
  );
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
  if (b1) {
    T4[i0]
       = T3[0];
  }
}
```
@zasdfgbnm zasdfgbnm changed the base branch from alloc-tmem to better-alloc-tmem January 31, 2025 21:47
@zasdfgbnm zasdfgbnm closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant