Tensor memory support 1 #3755

zasdfgbnm · 2025-01-24T01:30:36Z

Not ready for review

github-actions · 2025-01-24T01:31:25Z

PR Reviewer Guide 🔍

(Review updated until commit `f9adf69`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵

🧪 PR contains tests

⚡ Recommended focus areas for review

Memory Leak

The computeTMemInfo function allocates memory for TensorMemoryInfo and TMemAlllocationInfo but does not deallocate it. This could lead to memory leaks.

TensorMemoryInfo computeTMemInfo(Fusion* fusion) {
  TensorMemoryInfo result;

  // Compute the allocation information for tensor memory. Currently, we use a
  // very simple heuristic that assign a separate region for each TensorView.
  // See note [Tensor Memory Allocation] for the overall design.
  auto& regions = result.allocation.regions;
  for (auto tv : fusion->allTvs()) {
    if (tv->getMemoryType() != MemoryType::Tensor) {
      continue;
    }
    regions.emplace_back();
    auto& region = regions.back();

    region.address = TensorViewBuilder()
                         .shape(std::vector<Val*>{})
                         .dtype(DataType::UInt32)
                         .build();
    region.address->setMemoryType(MemoryType::Shared);

    // TODO: right now we hardcode the number of columns to be 32. This is
    // definitely not correct.
    region.num_columns = IrBuilder::create<Val>(32, DataType::UInt32);

    region.covered_tensors.emplace_back();
    auto& covered_tensor = region.covered_tensors.back();
    covered_tensor.tensor = tv;
    covered_tensor.lane_offset = tv->fusion()->zeroVal(DataType::UInt16);
    covered_tensor.column_offset = tv->fusion()->zeroVal(DataType::UInt16);
  }

  return result;
}

Potential Bug

The insertTMemRegionAllocsAndDeallocs function inserts deallocations after the outermost serial loop containing the last read of each TMem region. However, it does not check if the deallocation is already inserted before inserting it again. This could lead to duplicate deallocations.

    return inserter.exprs_;
  }
};

// Insert IR nodes that allocate and deallocate TMem regions.
// See note [Tensor Memory Allocation] for the overall design.
// We insert the tcgen05.allocs of each region and the relinquish of the right
// to allocate at the beginning of the top-level scope of the kernel. We insert
// the tcgen05.deallocs after the outermost serial loop containing the last read
// of each TMem region into whatever scope containing this outermost serial
// loop. The allocation of each TMem TensorView within each region is inserted
// by AllocationInserter::insert, therefore not handled here.
std::vector<Expr*> insertTMemRegionAllocsAndDeallocs(
    const std::vector<Expr*>& exprs) {
  // Expressions to be inserted at the beginning of the top-level scope.
  std::list<Expr*> prologue;
  {
    // For each TMem region, allocate its address in shared memory, and insert
    // the tcgen05.alloc for tensor memory allocation.
    for (const auto& region :
         GpuLower::current()->tmemInfo().allocation.regions) {
      // kir::Allocate for the address tensor on shared memory
      auto address_alloc_expr =
          IrBuilder::create<kir::Allocate>(region.address, MemoryType::Shared);
      prologue.push_back(address_alloc_expr);
      // the tcgen05.alloc instruction
      auto alloc_expr =
          IrBuilder::create<kir::AllocTMem>(region.address, region.num_columns);
      prologue.push_back(alloc_expr);
    }

    // Relinquish the right to allocate after all regions have been allocated
    auto tcgen05_relinquish_expr = IrBuilder::create<kir::Asm>(
        "tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned",
        std::vector<Val*>{},
        std::vector<Val*>{},
        kir::Asm::Options{/*volatile=*/true});
    prologue.push_back(tcgen05_relinquish_expr);

      // Block sync that makes allocation visible to all threads
      auto block_sync = IrBuilder::create<kir::BlockSync>();
      prologue.push_back(block_sync);
    }
  }

  // Add deallocations to existing expressions
  std::vector<Expr*> exprs_with_deallocs;
  {
    class DeallocInserter : public kir::ExprMutator {
      // A map:
      //   region -> a function that registers the deallocation expression for
      //             this region
      //
      // This map is updated during traversal. For example, if we have a kernel
      // like below:
      //   ...
      //   T1_t = T0_r; // expr1
      //   ...
      //   T2_r = T1_t; // expr2
      // Assume that T1_t is in region R1. Then after we handle(expr1), we will
      // have an entry:
      //    R1 -> a function registering insertion of "dealloc R1" after expr1
      // After handle(expr2), this entry becomes:
      //    R1 -> a function registering insertion of "dealloc R1" after expr2
      //
      // After traversing the entire kernel, this map will contain the final
      // register functions we want to execute.
      std::unordered_map<
          const TMemAlllocationInfo::Region*,
          std::function<void()>>
          region_to_register_dealloc_map_;

      // A map:
      //   expr -> the regions that this expr is accessing
      // Note that if expr is a container such as ForLoop or IfThenElse, then
      // the mapped regions will be all the regions the contained exprs are
      // accessing.
      //
      // This map only contain information of accesses that we have discovered,
      // and is updated during traversal. For example, if we have a kernel:
      //   ForLoop: // loop1
      //     T2_t = T0_r; // expr1
      //     ...
      //     T3_t = T1_r; // expr2
      // Assume T2_t is in region R2 and T3_t is in region R3. Then after
      // handle(expr1), we will have a map:
      //    expr1 -> {R2}
      //    loop1 -> {R2}
      // After handle(expr2), this map becomes:
      //    expr1 -> {R2}
      //    expr2 -> {R3}
      //    loop1 -> {R2, R3}
      std::unordered_map<
          Expr*,
          VectorOfUniqueEntries<const TMemAlllocationInfo::Region*>>
          access_map_;

      // Analyze expr to see if it has any accesses to tensor memory. If yes
      // update the access map for this expr and its container exprs.
      void updateAccessMap(Expr* expr) {
        std::unordered_set<Val*> io_vals;
        std::copy(
            expr->inputs().begin(),
            expr->inputs().end(),
            std::inserter(io_vals, io_vals.end()));
        std::copy(
            expr->outputs().begin(),
            expr->outputs().end(),
            std::inserter(io_vals, io_vals.end()));
        if (io_vals.empty()) {
          return;
        }
        for (const auto& region :
             GpuLower::current()->tmemInfo().allocation.regions) {
          for (auto tv_info : region.covered_tensors) {
            if (io_vals.count(tv_info.tensor)) {
              access_map_[expr].pushBack(&region);
              for (auto container : scope_exprs_) {
                access_map_[container].pushBack(&region);
              }
            }
            break;
          }
        }
      }

      // Update the region_to_register_dealloc_map_ to register insertion of
      // deallocation expression after expr for the regions accessed by expr.
      void maybeRegisterDeallocsAfterExpr(Expr* expr) {
        // If expr is a trivial for loop, then we don't need to move the
        // deallocation after it. This is because the trivial is not generated
        // in the final code.
        if (auto fl = dynamic_cast<ForLoop*>(expr)) {
          if (fl->isTrivial()) {
            return;
          }
        }
        // If expr is not accessing any tensor memory, then nothing to do.
        if (!access_map_.count(expr)) {
          return;
        }
        for (auto region : access_map_.at(expr)) {
          auto current_scope = scope_.empty() ? nullptr : scope_.back();
          region_to_register_dealloc_map_[region] =
              [this, expr, region, current_scope]() {
                auto tcgen05_dealloc_expr = IrBuilder::create<kir::Asm>(
                    "tcgen05.dealloc.cta_group::1.sync.aligned.b32",
                    std::vector<Val*>{},
                    std::vector<Val*>{
                        IrBuilder::create<kir::TensorIndex>(
                            region->address, expr->fusion()->zeroVal()),
                        region->num_columns},
                    kir::Asm::Options{/*volatile=*/true});
                registerInsertAfter(expr, tcgen05_dealloc_expr, current_scope);
              };
        }
      }

      void dispatch(Expr* expr) final {
        updateAccessMap(expr);
        ExprMutator::dispatch(expr);
        maybeRegisterDeallocsAfterExpr(expr);
      }

     public:
      DeallocInserter(
          const std::vector<Expr*>& exprs,
          std::vector<Expr*>& exprs_with_deallocs) {
        handle(exprs);
        for (const auto& region :
             GpuLower::current()->tmemInfo().allocation.regions) {
          region_to_register_dealloc_map_.at (&region)();
        }
        exprs_with_deallocs = mutate();
      }
    } inserter(exprs, exprs_with_deallocs);
  }

  // Combine prologue and exprs_with_deallocs
  std::vector<Expr*> result;
  result.reserve(prologue.size() + exprs_with_deallocs.size());
  result.insert(result.end(), prologue.begin(), prologue.end());
  result.insert(
      result.end(), exprs_with_deallocs.begin(), exprs_with_deallocs.end());
  return result;
}

} // namespace

Inconsistent Error Handling

The Allocate class has inconsistent error handling. Some methods throw exceptions while others return error codes. It would be better to have a consistent error handling approach throughout the class.

  Val* size() const {
    return input(0);
  }

  //! Size of each dimension
  std::vector<Val*> shape() const {
    constexpr int64_t num_attributes_before_shape = 8;
    std::vector<Val*> result;
    result.reserve(attributes().size() - num_attributes_before_shape);
    for (auto i = attributes().begin() + num_attributes_before_shape;
         i != attributes().end();
         ++i) {
      result.emplace_back((*i)->as<Val>());
    }
    return result;
  }

  //! Does this allocation require its memory to be initialized to zero before
  //! this kernel is launched? If this is true, then an additional memset
  //! kernel might be launched before the current Fusion kernel is launched in
  //! order to guarantee that this buffer is filled with zeroes (see
  //! resetsToZero() below).
  bool zeroInit() const {
    return attribute<bool>(2);
  }

  //! Is this buffer guaranteed to be reset to all zero values at the end of
  //! this kernel? This is used to avoid an additional memset kernel launch for
  //! buffers that require zeroed memory (see zeroInit() above).
  //!
  //! A common use case for zeroInit() allocations is semaphore buffers that
  //! hold counters starting at zero. Typically, each participating thread would
  //! increment the counter and the last thread would leave the counter in a
  //! non-zeroed state. The next time that kernel is run, it can no longer
  //! re-use the non-zero semaphore buffer, so KernelExecutor will launch
  //! at::zeroes to allocate a new buffer, resulting in a memset kernel launch.
  //!
  //! Instead, if the last thread resets the counter to zero, then the buffer
  //! can be re-used, and at::zeroes need only be run at the first kernel
  //! launch. If resetsToZero() is true, then KernelExecutor will use
  //! contigZeroedTensor() and releaseZeroedMemory() from global_allocator.h to
  //! reuse zeroed memory avoiding the additional kernel launch.
  //!
  //! Whenever possible, we should try to guarantee that resetsToZero() is true
  //! if zeroInit() is true by modifying our code to clean up global counters,
  //! because the latency penalty of an additional kernel launch should be
  //! greater than that required to reset this memory at the end of the fusion.
  //! The exception is when a kernel is launched only a single time, in which
  //! case resetting the memory is unnecessary, but we expect that kernels will
  //! instead be launched many times.
  bool resetsToZero() const {
    return attribute<bool>(3);
  }

  // This alias tracks the next Allocate node in a linked chain of aliases
  // If the alias is nullptr, then the Allocate node uses memory in the kernel
  const Allocate* alias() const {
    return dynamic_cast<const Allocate*>(attribute(4));
  }

  // Set the address of a shared memory allocation within the dynamic shared
  // memory array. The addr argument should be a scalar expression describing an
  // aligned address in bytes.
  void setAddress(Val* addr) {
    NVF_CHECK(
        memoryType() == MemoryType::Shared ||
            memoryType() == MemoryType::Tensor,
        "Allocation address may only be set for shared/tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        address() == nullptr,
        "Attempted to set address twice for allocation ",
        toString());
    attributes_[5] = addr;
  }

  void setLaneOffset(Val* lane_offset) {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Lane offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        laneOffset() == nullptr,
        "Attempted to set lane offset twice for allocation ",
        toString());
    attributes_[6] = lane_offset;
  }

  void setColOffset(Val* col_offset) {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Column offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    NVF_CHECK(
        colOffset() == nullptr,
        "Attempted to set column offset twice for allocation ",
        toString());
    attributes_[7] = col_offset;
  }

  // This is an integer scalar describing the byte address within the dynamic
  // shared memory array for a shared memory allocation. For memory types other
  // than Shared, or before allocation, this function might return nullptr.
  Val* address() const {
    NVF_CHECK(
        memoryType() == MemoryType::Shared ||
            memoryType() == MemoryType::Tensor,
        "Allocation address may only be set for shared memory allocations. Memory type is ",
        memoryType());
    return attributeVal(5);
  }

  Val* laneOffset() const {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Lane offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    return attributeVal(6);
  }

  Val* colOffset() const {
    NVF_CHECK(
        memoryType() == MemoryType::Tensor,
        "Column offset may only be set for tensor memory allocations. Memory type is ",
        memoryType());
    return attributeVal(7);
  }
};

zasdfgbnm · 2025-01-29T00:40:49Z

tests/cpp/test_memory.cpp

+  testValidate(&fusion, cg_outputs, {t0}, {t0}, __LINE__, __FILE__);
+}
+
+TEST_F(TMemTest, AddKernel) {


__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32LL * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 16LL); asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T10))), "n"(32U)); uint32_t* T11 = reinterpret_cast<uint32_t*>(array + smem_offset + 0LL); asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T11))), "n"(32U)); asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n"); __syncthreads(); float T1[1LL]; T1[0LL] = 0LL; if (b1) { T1[0LL] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T2(T10[0LL], 0, 0); asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})), "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0LL]))[0]) ); asm volatile("tcgen05.wait::st.sync.aligned;\n"); float T3[1LL]; asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0LL]))[0]) :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})) ); asm volatile("tcgen05.wait::ld.sync.aligned;\n"); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0LL]), "n"(32U)); float T5[1LL]; T5[0LL] = 0LL; if (b1) { T5[0LL] = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T6(T11[0LL], 0, 0); asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})), "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T5[0LL]))[0]) ); asm volatile("tcgen05.wait::st.sync.aligned;\n"); float T7[1LL]; asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0LL]))[0]) :"r"((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})) ); asm volatile("tcgen05.wait::ld.sync.aligned;\n"); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T11[0LL]), "n"(32U)); float T8[1LL]; T8[0LL] = T3[0LL] + T7[0LL]; if (b1) { T9[i0] = T8[0LL]; } }

zasdfgbnm · 2025-01-29T00:41:34Z

tests/cpp/test_memory.cpp

+// Tensor memory tests
+using TMemTest = NVFuserTest;
+
+TEST_F(TMemTest, GmemRegTMemRegGmemCopy) {


__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32LL * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; uint32_t* T5 = reinterpret_cast<uint32_t*>(array + smem_offset + 0LL); asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"((uint32_t)(toSmem(T5))), "n"(32U)); asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n"); __syncthreads(); float T1[1LL]; T1[0LL] = 0LL; if (b1) { T1[0LL] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32LL * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T2(T5[0LL], 0, 0); asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})), "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0LL]))[0]) ); asm volatile("tcgen05.wait::st.sync.aligned;\n"); float T3[1LL]; asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0LL]))[0]) :"r"((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})) ); asm volatile("tcgen05.wait::ld.sync.aligned;\n"); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T5[0LL]), "n"(32U)); if (b1) { T4[i0] = T3[0LL]; } }

Extracted from #3755 to make it easier to review.

Minor change extracted from: #3755

Extracted from #3755 to make code review easy. This PR adds a new unit test `TMemTest.GmemRegTMemRegGmemCopy` that schedules a copy kernel gmem -> register -> tmem -> register -> gmem, and update our system with the minimum required changes to make this test pass. The purpose of this PR is not to provide a good implementation of TMem support, but just to provide the absolute minimal requirement for us to start. Limitations are: 1. The index is hard coded zero, so this PR is not touching the interesting topic of "how to schedule TMem tensor?" 2. The TMem is used without allocation. Using a memory that is not allocated is clearly a wrong way to program, but as described in the code comment, if a fusion only has one TMem TensorView, it is guaranteed to work. Generated code: ```CUDA __global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4) { nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; Array<float, 1, 1> T1; T1[0] = 0; if (b1) { T1[0] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"(0U), "f"((*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))[0]) ); asm volatile("tcgen05.wait::st.sync.aligned;\n"); Array<float, 1, 1> T3; asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0]))[0]) :"r"(0U) ); asm volatile("tcgen05.wait::ld.sync.aligned;\n"); if (b1) { T4[i0] = T3[0]; } } ```

tmem-support

9a50036

zasdfgbnm added 15 commits January 23, 2025 17:40

save

1637943

save

fdeefc5

dealloc

c7317c1

save

5495485

save

c859f19

kir::Allocate for TMEM

40fc338

asm for ld and st

f18ac2e

alloc tmem address on smem

eb7ef2c

save

12aa487

save

46179b5

fix

0759b28

fix alloc

ee8871d

ldtm array

a0d73ea

DataType::TMemAddress

4adf38c

wait

7ca58dd

zasdfgbnm commented Jan 29, 2025

View reviewed changes

zasdfgbnm added 3 commits January 28, 2025 16:42

revert

72ef144

Merge remote-tracking branch 'upstream/main' into tmem-copy-kernel

c04a77d

fix

03abb0e

This was referenced Jan 29, 2025

Add MemoryType::Tensor #3780

Merged

Use arch-specific PTX on Blackwell #3784

Merged

zasdfgbnm added a commit that referenced this pull request Jan 29, 2025

Add MemoryType::Tensor (#3780)

0f1ba30

Extracted from #3755 to make it easier to review.

zasdfgbnm added a commit that referenced this pull request Jan 29, 2025

Use arch-specific PTX on Blackwell (#3784)

a470821

Minor change extracted from: #3755

Merge branch 'main' into tmem-copy-kernel

0848905

zasdfgbnm mentioned this pull request Jan 29, 2025

Naive register <-> tmem load/store support #3786

Merged

Merge branch 'tmem-no-alloc' into tmem-copy-kernel

237107e

zasdfgbnm changed the base branch from main to tmem-no-alloc January 29, 2025 20:05

revert

018eb8c

zasdfgbnm mentioned this pull request Jan 29, 2025

tcgen05.alloc TMem usage #3795

Open

Merge branch 'alloc-tmem' into tmem-copy-kernel

183f52b

zasdfgbnm changed the base branch from tmem-no-alloc to alloc-tmem January 30, 2025 00:55

zasdfgbnm added 2 commits January 30, 2025 00:39

Merge branch 'alloc-tmem' into tmem-copy-kernel

8bcbcae

Merge branch 'alloc-tmem' into tmem-copy-kernel

f9adf69

zasdfgbnm changed the base branch from alloc-tmem to better-alloc-tmem January 31, 2025 21:47

zasdfgbnm closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor memory support 1 #3755

Tensor memory support 1 #3755

zasdfgbnm commented Jan 24, 2025

github-actions bot commented Jan 24, 2025 •

edited

Loading

zasdfgbnm Jan 29, 2025

zasdfgbnm Jan 29, 2025

Tensor memory support 1 #3755

Tensor memory support 1 #3755

Conversation

zasdfgbnm commented Jan 24, 2025

github-actions bot commented Jan 24, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit f9adf69)

zasdfgbnm Jan 29, 2025

Choose a reason for hiding this comment

zasdfgbnm Jan 29, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 24, 2025 •

edited

Loading

(Review updated until commit `f9adf69`)