[Common] NVTEGroupedTensor class and helpers #2388

phu0ngng · 2025-11-14T20:36:41Z

Description

NVTEGroupedTensor class and helpers

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

zhongbozhu · 2025-11-14T23:19:30Z

LGTM as discussed offline.

Will be better if we can have some example usage of the new API. Otherwise, the new checkGroupOutputTensor seems complicated and I am not sure if it's too strict.

transformer_engine/common/common.h

ptrendx · 2025-11-14T23:41:54Z

transformer_engine/common/common.h

+  // [TODO] Discuss whether the first_dims and second_dims should be according to layout N
+  // Shape information: first_dims[i] and second_dims[i] define the shape of the i-th tensor
+  // For 2D tensors: shape[i] = (first_dims[i], second_dims[i])
+  SimpleTensor first_dims;   // Device pointer to size_t array of length num_tensors
+  SimpleTensor second_dims;  // Device pointer to size_t array of length num_tensors


The way I think about it is:

we need to standardize first on which direction is used to say what is "first" and what is "second" (I prefer "last" BTW) -> I vote for rowwise

then my thinking is that if the rowwise allocation has shape [m, k], then existence (or not) of those shapes would tell us which of the dimension is constant (e.g. second_dim being noninitialized would mean that all tensors are of shape [m_i, k]), which could be used for additional optimizations (e.g. via specialized kernel choice).

Alternatively the "reference" shape could be a property of the GroupedTensor itself just to avoid setting the shape on otherwise uninitialized rowwise tensor.

I'm interpreting these dims as the logical tensor dims, which matches the row-wise data dims. Logical dims are completely independent on the data format, regardless of whether the column-wise data is transposed or not.

I like the idea of the grouped tensor holding the "reference" shape and using it depending on whether first_dim/second_dim are empty. I don't think we can rely on the shape of the data tensors since they might need to be flattened to 1D, e.g. FP8 transpose when splitting along the first dim.

Being able to split along 2 dims makes this very general. However, for MoE we always split along the first logical dim (PyTorch has column-major weights and JAX has row-major, so we need to swap the usage of row-wise and column-wise data. However, we are still splitting along the first logical dim). We should decide whether future-proofing is worth the extra complexity.

Actually, on second thought I think my last bullet point is still true. For MoE, we are always splitting along the first logical dim. Internally we might be splitting a transposed matrix along the last dim, but that is a detail within the group tensor class, and shouldn't be exposed in the public API.

transformer_engine/common/transformer_engine.cpp

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2025-11-19T21:25:09Z

transformer_engine/common/common.h

+  SimpleTensor data;
+  SimpleTensor columnwise_data;
+  SimpleTensor scale_inv;
+  SimpleTensor columnwise_scale_inv;
+  SimpleTensor amax;
+  SimpleTensor columnwise_amax;
+  SimpleTensor scale;  // for FP8-DS only


Having a giant pile of variables is fine since we want to make progress on this quickly, but in the future we should consider refactoring to handle polymorphism more gracefully:

// Visitor pattern struct GroupedTensor { public: struct FP8Data { std::optional<SimpleTensor> data; std::optional<SimpleTensor> transpose; SimpleTensor scale_inv; std::optional<SimpleTensor> amax; std::optional<SimpleTensor> scale; }; struct MXFP8Data { std::optional<std::tuple<SimpleTensor, SimpleTensor>> rowwise_data_and_scale; std::optional<std::tuple<SimpleTensor, SimpleTensor>> columnwise_data_and_scale; } std::variant<FP8Data, MXFP8Data, ...> data; }; // Inheritance pattern struct GroupedTensor { ... }; struct FP8GroupedTensor : public GroupedTensor { std::optional<SimpleTensor> data; std::optional<SimpleTensor> transpose; SimpleTensor scale_inv; std::optional<SimpleTensor> amax; std::optional<SimpleTensor> scale; }; struct MXFP8GroupedTensor : public GroupedTensor { std::optional<std::tuple<SimpleTensor, SimpleTensor>> rowwise_data_and_scale; std::optional<std::tuple<SimpleTensor, SimpleTensor>> columnwise_data_and_scale; };

timmoon10 · 2025-11-19T21:53:57Z

transformer_engine/common/common.h

+  // [TODO] Discuss whether the first_dims and second_dims should be according to layout N
+  // Shape information: first_dims[i] and second_dims[i] define the shape of the i-th tensor
+  // For 2D tensors: shape[i] = (first_dims[i], second_dims[i])
+  SimpleTensor first_dims;   // Device pointer to size_t array of length num_tensors
+  SimpleTensor second_dims;  // Device pointer to size_t array of length num_tensors


I'm interpreting these dims as the logical tensor dims, which matches the row-wise data dims. Logical dims are completely independent on the data format, regardless of whether the column-wise data is transposed or not.

I like the idea of the grouped tensor holding the "reference" shape and using it depending on whether first_dim/second_dim are empty. I don't think we can rely on the shape of the data tensors since they might need to be flattened to 1D, e.g. FP8 transpose when splitting along the first dim.

Being able to split along 2 dims makes this very general. However, for MoE we always split along the first logical dim (PyTorch has column-major weights and JAX has row-major, so we need to swap the usage of row-wise and column-wise data. However, we are still splitting along the first logical dim). We should decide whether future-proofing is worth the extra complexity.

transformer_engine/common/transformer_engine.cpp

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng · 2025-11-24T21:35:52Z

/te-ci L0

greptile-apps · 2025-11-24T21:36:45Z

Greptile Overview

Greptile Summary

This PR introduces NVTEGroupedTensor, a new abstraction for managing collections of tensors with varying shapes but uniform dtype and scaling mode. The implementation follows the existing TensorAllocator pattern with a pool-based allocator using 1-based indexing and a free list for reuse.

Key additions:

GroupedTensor struct with flexible shape representation (logical_shape, first_dims, last_dims, tensor_offsets)
GroupedTensorAllocator for memory-efficient tensor management
Comprehensive validation helpers (CheckGroupedTensorShapeArrays, CheckInputGroupedTensor, CheckOutputGroupedTensor)
Complete C API with create/destroy/get/set operations

Previous review comments addressed:
Most issues from prior reviews appear to have been addressed or are protected by validation guards. The code includes appropriate checks for empty tensors, validates num_tensors > 0 at creation, and wraps dtype-dependent checks in data existence conditions.

Confidence Score: 3/5

This PR has solid architecture and validation but contains thread safety concerns that need attention before merging
Score reflects well-designed API and validation logic, but deducted points for: (1) potential thread safety issue in convertNVTEGroupedTensor accessing vector without lock despite atomic size check, (2) inconsistent null-checking patterns (dptr != nullptr vs has_data()), and (3) lack of tests mentioned in checklist. The experimental nature is appropriately marked.
Pay close attention to transformer_engine/common/transformer_engine.cpp, particularly the GroupedTensorAllocator::convertNVTEGroupedTensor method for thread safety issues

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/common.h	4/5	Adds `GroupedTensor` struct with shape metadata helpers (`first_dims`, `last_dims`, `tensor_offsets`) and validation methods. Includes `has_data()` method for `SimpleTensor`. Minor TODO comments indicate future refactoring.
transformer_engine/common/include/transformer_engine/transformer_engine.h	5/5	Adds C API for grouped tensors including creation, destruction, parameter getters/setters, and query functions. Well-documented with experimental markers. Clean interface design.
transformer_engine/common/transformer_engine.cpp	3/5	Implements `GroupedTensorAllocator` and validation helpers (`CheckGroupedTensorShapeArrays`, `CheckInputGroupedTensor`, `CheckOutputGroupedTensor`). Contains thread safety concerns in `convertNVTEGroupedTensor` and inconsistent null-checking patterns.

Sequence Diagram

sequenceDiagram
    participant User
    participant C_API as C API Layer
    participant Allocator as GroupedTensorAllocator
    participant Memory as Vector<GroupedTensor>
    participant Validator as Validation Functions

    User->>C_API: nvte_create_grouped_tensor(mode, num_tensors, logical_shape)
    C_API->>C_API: Validate num_tensors > 0
    C_API->>C_API: Validate logical_shape 2D and positive
    C_API->>Allocator: Allocate(mode, num_tensors, logical_shape)
    
    alt Free list not empty
        Allocator->>Allocator: Pop index from free_list
        Allocator->>Memory: memory[index-1].clear()
        Allocator->>Memory: Set scaling_mode, num_tensors, logical_shape
    else New allocation needed
        Allocator->>Memory: emplace_back(mode, num_tensors)
        Allocator->>Allocator: Update atomic size variable
        Allocator->>Memory: Set logical_shape
    end
    
    Allocator-->>C_API: Return NVTEGroupedTensor (index as void*)
    C_API-->>User: Return tensor handle

    User->>C_API: nvte_set_grouped_tensor_param(tensor, param, data)
    C_API->>Allocator: convertNVTEGroupedTensor(tensor)
    Note over Allocator: Race condition risk: reads atomic size<br/>without mutex, accesses memory vector
    Allocator-->>C_API: Return GroupedTensor*
    C_API->>Memory: Set parameter (data, first_dims, etc.)
    C_API-->>User: Parameter set

    User->>Validator: CheckInputGroupedTensor(tensor)
    Validator->>Validator: Check has_data() or has_columnwise_data()
    Validator->>Validator: CheckGroupedScaleInv()
    Validator->>Validator: CheckGroupedTensorShapeArrays()
    
    Note over Validator: Validates:<br/>- Shape arrays (first_dims, last_dims, tensor_offsets)<br/>- Logical shape is 2D<br/>- Data size matches logical_shape<br/>- Scale/scale_inv dtypes
    
    Validator-->>User: Validation result

    User->>C_API: nvte_destroy_grouped_tensor(tensor)
    C_API->>Allocator: Free(tensor)
    Allocator->>Memory: memory[index-1].clear()
    Allocator->>Allocator: Push index to free_list
    C_API-->>User: Tensor destroyed

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/transformer_engine.cpp

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/transformer_engine.cpp

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/transformer_engine.cpp

timmoon10 · 2025-11-25T02:16:23Z

transformer_engine/common/include/transformer_engine/transformer_engine.h

+void nvte_set_grouped_tensor_param(NVTEGroupedTensor *tensor, NVTEGroupedTensorParam param_name,
+                                   const NVTEBasicTensor *param);


This works when we're setting basic tensors, but doesn't generalize to other types (bool/float/etc). Consider using a more general API like how we handle NVTEQuantizationConfig:

TransformerEngine/transformer_engine/common/include/transformer_engine/transformer_engine.h

Lines 369 to 371 in f8cb598

void nvte_set_quantization_config_attribute(NVTEQuantizationConfig config,

NVTEQuantizationConfigAttribute attr, const void *buf,

size_t size_in_bytes);

This is completely general, but also more cumbersome.

Hi, I don't think we can go with a similar API, i.e., using just void* buf and size_in_bytes as we do need different dtype for different fields.

transformer_engine/common/include/transformer_engine/transformer_engine.h

transformer_engine/common/common.h

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/common.h

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/transformer_engine.cpp

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

phu0ngng · 2025-11-26T15:36:24Z

/te-ci L0

timmoon10

Overall LGTM, but this is hitting PyTorch test failures due to changes in Tensor::has_data/Tensor::has_columnwise_data. We should merge #2330 first.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng · 2025-11-26T18:57:01Z

/te-ci L0

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ptrendx · 2025-11-27T01:17:38Z

transformer_engine/common/transformer_engine.cpp

+      NVTEGroupedTensor ret = reinterpret_cast<NVTEGroupedTensor>(index);
+      free_list.pop_back();
+      // 1-based indexing - fully reinitialize the tensor to avoid stale data
+      memory[index - 1].clear();


Why do we clear it again?

timmoon10

LGTM

add grouped_tensor classes and helpers

801936c

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ptrendx reviewed Nov 15, 2025

View reviewed changes

phu0ngng and others added 2 commits November 17, 2025 17:57

rm non-contiguous option and dptrs

36fb4c0

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3df4d2f

for more information, see https://pre-commit.ci

timmoon10 reviewed Nov 19, 2025

View reviewed changes

phu0ngng and others added 3 commits November 19, 2025 22:51

address comments + rework CheckIn/OutputGroupedTensor

b798042

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7731134

for more information, see https://pre-commit.ci

fix for compilation

8bb91c6

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

timmoon10 mentioned this pull request Nov 20, 2025

[PyTorch][NVFP4][MOE] NVFP4 Grouped Hadamard Amax Kernel #2351

Merged

17 tasks

phu0ngng and others added 3 commits November 21, 2025 12:03

make first_dims/last_dims optional + data.shape 2d

8628579

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

added assertion

1e3921a

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

116d907

for more information, see https://pre-commit.ci

ptrendx added the MoE label Nov 22, 2025

phu0ngng and others added 5 commits November 24, 2025 12:45

rs conflicts

fb0e77b

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

add data.shape info

58f5b2a

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

added logical shape field

5420c72

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc90cac

for more information, see https://pre-commit.ci

compilation fix

070ba8e

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng marked this pull request as ready for review November 24, 2025 21:33

Merge branch 'main' into nvte_grouped_tensor

981ac84

greptile-apps bot reviewed Nov 24, 2025

View reviewed changes

transformer_engine/common/transformer_engine.cpp Outdated Show resolved Hide resolved

transformer_engine/common/transformer_engine.cpp Outdated Show resolved Hide resolved

phu0ngng and others added 2 commits November 24, 2025 16:40

fixed issues raised by greptile

938ec98

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4424b7c

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Nov 24, 2025

View reviewed changes

transformer_engine/common/transformer_engine.cpp Show resolved Hide resolved

phu0ngng and others added 2 commits November 24, 2025 16:52

return default dtype when grouped_tensor is empty

1295d8e

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

69bd334

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Nov 24, 2025

View reviewed changes

transformer_engine/common/transformer_engine.cpp Show resolved Hide resolved

timmoon10 reviewed Nov 25, 2025

View reviewed changes

timmoon10 self-requested a review November 25, 2025 02:47

phu0ngng and others added 2 commits November 25, 2025 09:55

use has_data() for dim queries

8a35897

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a490bd1

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

timmoon10 reviewed Nov 26, 2025

View reviewed changes

transformer_engine/common/common.h Outdated Show resolved Hide resolved

update comments

b1c4f68

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

fix index bound

52f88f9

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

transformer_engine/common/transformer_engine.cpp Outdated Show resolved Hide resolved

transformer_engine/common/transformer_engine.cpp Outdated Show resolved Hide resolved

phu0ngng and others added 2 commits November 26, 2025 10:28

Update transformer_engine/common/transformer_engine.cpp

0e270c5

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Update transformer_engine/common/transformer_engine.cpp

9457f2c

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

timmoon10 reviewed Nov 26, 2025

View reviewed changes

phu0ngng and others added 3 commits November 26, 2025 13:50

restore Tensor.has_data() + add experimental marks

b701cc9

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad01ee6

for more information, see https://pre-commit.ci

restore Tensor::has_columnwise_data

d96da24

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

ptrendx reviewed Nov 27, 2025

View reviewed changes

timmoon10 mentioned this pull request Nov 27, 2025

[Core] Fix inconsistent logic in C++ tensor class #2330

Open

13 tasks

timmoon10 approved these changes Nov 27, 2025

View reviewed changes

		void nvte_set_grouped_tensor_param(NVTEGroupedTensor *tensor, NVTEGroupedTensorParam param_name,
		const NVTEBasicTensor *param);

	void nvte_set_quantization_config_attribute(NVTEQuantizationConfig config,
	NVTEQuantizationConfigAttribute attr, const void *buf,
	size_t size_in_bytes);

[Common] NVTEGroupedTensor class and helpers #2388

Are you sure you want to change the base?

[Common] NVTEGroupedTensor class and helpers #2388

Uh oh!

Conversation

phu0ngng commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

zhongbozhu commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phu0ngng commented Nov 24, 2025

Uh oh!

greptile-apps bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phu0ngng commented Nov 14, 2025 •

edited

Loading

timmoon10 Nov 19, 2025 •

edited

Loading

timmoon10 Nov 19, 2025 •

edited

Loading

greptile-apps bot commented Nov 24, 2025 •

edited

Loading