Skip to content

Conversation

@phu0ngng
Copy link
Collaborator

@phu0ngng phu0ngng commented Nov 14, 2025

Description

NVTEGroupedTensor class and helpers

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@zhongbozhu
Copy link
Collaborator

LGTM as discussed offline.

Will be better if we can have some example usage of the new API. Otherwise, the new checkGroupOutputTensor seems complicated and I am not sure if it's too strict.

Comment on lines 390 to 394
// [TODO] Discuss whether the first_dims and second_dims should be according to layout N
// Shape information: first_dims[i] and second_dims[i] define the shape of the i-th tensor
// For 2D tensors: shape[i] = (first_dims[i], second_dims[i])
SimpleTensor first_dims; // Device pointer to size_t array of length num_tensors
SimpleTensor second_dims; // Device pointer to size_t array of length num_tensors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think about it is:

  • we need to standardize first on which direction is used to say what is "first" and what is "second" (I prefer "last" BTW) -> I vote for rowwise
  • then my thinking is that if the rowwise allocation has shape [m, k], then existence (or not) of those shapes would tell us which of the dimension is constant (e.g. second_dim being noninitialized would mean that all tensors are of shape [m_i, k]), which could be used for additional optimizations (e.g. via specialized kernel choice).

Alternatively the "reference" shape could be a property of the GroupedTensor itself just to avoid setting the shape on otherwise uninitialized rowwise tensor.

Copy link
Collaborator

@timmoon10 timmoon10 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm interpreting these dims as the logical tensor dims, which matches the row-wise data dims. Logical dims are completely independent on the data format, regardless of whether the column-wise data is transposed or not.
  • I like the idea of the grouped tensor holding the "reference" shape and using it depending on whether first_dim/second_dim are empty. I don't think we can rely on the shape of the data tensors since they might need to be flattened to 1D, e.g. FP8 transpose when splitting along the first dim.
  • Being able to split along 2 dims makes this very general. However, for MoE we always split along the first logical dim (PyTorch has column-major weights and JAX has row-major, so we need to swap the usage of row-wise and column-wise data. However, we are still splitting along the first logical dim). We should decide whether future-proofing is worth the extra complexity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought I think my last bullet point is still true. For MoE, we are always splitting along the first logical dim. Internally we might be splitting a transposed matrix along the last dim, but that is a detail within the group tensor class, and shouldn't be exposed in the public API.

phu0ngng and others added 2 commits November 17, 2025 17:57
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Comment on lines +293 to +299
SimpleTensor data;
SimpleTensor columnwise_data;
SimpleTensor scale_inv;
SimpleTensor columnwise_scale_inv;
SimpleTensor amax;
SimpleTensor columnwise_amax;
SimpleTensor scale; // for FP8-DS only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a giant pile of variables is fine since we want to make progress on this quickly, but in the future we should consider refactoring to handle polymorphism more gracefully:

// Visitor pattern
struct GroupedTensor {
 public:
  struct FP8Data {
    std::optional<SimpleTensor> data;
    std::optional<SimpleTensor> transpose;
    SimpleTensor scale_inv;
    std::optional<SimpleTensor> amax;
    std::optional<SimpleTensor> scale;
  };
  struct MXFP8Data {
    std::optional<std::tuple<SimpleTensor, SimpleTensor>> rowwise_data_and_scale;
    std::optional<std::tuple<SimpleTensor, SimpleTensor>> columnwise_data_and_scale;
  }
  std::variant<FP8Data, MXFP8Data, ...> data;
};

// Inheritance pattern
struct GroupedTensor { ... };
struct FP8GroupedTensor : public GroupedTensor {
    std::optional<SimpleTensor> data;
    std::optional<SimpleTensor> transpose;
    SimpleTensor scale_inv;
    std::optional<SimpleTensor> amax;
    std::optional<SimpleTensor> scale;
};
struct MXFP8GroupedTensor : public GroupedTensor {
    std::optional<std::tuple<SimpleTensor, SimpleTensor>> rowwise_data_and_scale;
    std::optional<std::tuple<SimpleTensor, SimpleTensor>> columnwise_data_and_scale;
};

Comment on lines 390 to 394
// [TODO] Discuss whether the first_dims and second_dims should be according to layout N
// Shape information: first_dims[i] and second_dims[i] define the shape of the i-th tensor
// For 2D tensors: shape[i] = (first_dims[i], second_dims[i])
SimpleTensor first_dims; // Device pointer to size_t array of length num_tensors
SimpleTensor second_dims; // Device pointer to size_t array of length num_tensors
Copy link
Collaborator

@timmoon10 timmoon10 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm interpreting these dims as the logical tensor dims, which matches the row-wise data dims. Logical dims are completely independent on the data format, regardless of whether the column-wise data is transposed or not.
  • I like the idea of the grouped tensor holding the "reference" shape and using it depending on whether first_dim/second_dim are empty. I don't think we can rely on the shape of the data tensors since they might need to be flattened to 1D, e.g. FP8 transpose when splitting along the first dim.
  • Being able to split along 2 dims makes this very general. However, for MoE we always split along the first logical dim (PyTorch has column-major weights and JAX has row-major, so we need to swap the usage of row-wise and column-wise data. However, we are still splitting along the first logical dim). We should decide whether future-proofing is worth the extra complexity.

phu0ngng and others added 3 commits November 19, 2025 22:51
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
phu0ngng and others added 3 commits November 21, 2025 12:03
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@ptrendx ptrendx added the MoE label Nov 22, 2025
phu0ngng and others added 5 commits November 24, 2025 12:45
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng marked this pull request as ready for review November 24, 2025 21:33
@phu0ngng
Copy link
Collaborator Author

/te-ci L0

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 24, 2025

Greptile Overview

Greptile Summary

This PR introduces NVTEGroupedTensor, a new abstraction for managing collections of tensors with varying shapes but uniform dtype and scaling mode. The implementation follows the existing TensorAllocator pattern with a pool-based allocator using 1-based indexing and a free list for reuse.

Key additions:

  • GroupedTensor struct with flexible shape representation (logical_shape, first_dims, last_dims, tensor_offsets)
  • GroupedTensorAllocator for memory-efficient tensor management
  • Comprehensive validation helpers (CheckGroupedTensorShapeArrays, CheckInputGroupedTensor, CheckOutputGroupedTensor)
  • Complete C API with create/destroy/get/set operations

Previous review comments addressed:
Most issues from prior reviews appear to have been addressed or are protected by validation guards. The code includes appropriate checks for empty tensors, validates num_tensors > 0 at creation, and wraps dtype-dependent checks in data existence conditions.

Confidence Score: 3/5

  • This PR has solid architecture and validation but contains thread safety concerns that need attention before merging
  • Score reflects well-designed API and validation logic, but deducted points for: (1) potential thread safety issue in convertNVTEGroupedTensor accessing vector without lock despite atomic size check, (2) inconsistent null-checking patterns (dptr != nullptr vs has_data()), and (3) lack of tests mentioned in checklist. The experimental nature is appropriately marked.
  • Pay close attention to transformer_engine/common/transformer_engine.cpp, particularly the GroupedTensorAllocator::convertNVTEGroupedTensor method for thread safety issues

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/common/common.h 4/5 Adds GroupedTensor struct with shape metadata helpers (first_dims, last_dims, tensor_offsets) and validation methods. Includes has_data() method for SimpleTensor. Minor TODO comments indicate future refactoring.
transformer_engine/common/include/transformer_engine/transformer_engine.h 5/5 Adds C API for grouped tensors including creation, destruction, parameter getters/setters, and query functions. Well-documented with experimental markers. Clean interface design.
transformer_engine/common/transformer_engine.cpp 3/5 Implements GroupedTensorAllocator and validation helpers (CheckGroupedTensorShapeArrays, CheckInputGroupedTensor, CheckOutputGroupedTensor). Contains thread safety concerns in convertNVTEGroupedTensor and inconsistent null-checking patterns.

Sequence Diagram

sequenceDiagram
    participant User
    participant C_API as C API Layer
    participant Allocator as GroupedTensorAllocator
    participant Memory as Vector<GroupedTensor>
    participant Validator as Validation Functions

    User->>C_API: nvte_create_grouped_tensor(mode, num_tensors, logical_shape)
    C_API->>C_API: Validate num_tensors > 0
    C_API->>C_API: Validate logical_shape 2D and positive
    C_API->>Allocator: Allocate(mode, num_tensors, logical_shape)
    
    alt Free list not empty
        Allocator->>Allocator: Pop index from free_list
        Allocator->>Memory: memory[index-1].clear()
        Allocator->>Memory: Set scaling_mode, num_tensors, logical_shape
    else New allocation needed
        Allocator->>Memory: emplace_back(mode, num_tensors)
        Allocator->>Allocator: Update atomic size variable
        Allocator->>Memory: Set logical_shape
    end
    
    Allocator-->>C_API: Return NVTEGroupedTensor (index as void*)
    C_API-->>User: Return tensor handle

    User->>C_API: nvte_set_grouped_tensor_param(tensor, param, data)
    C_API->>Allocator: convertNVTEGroupedTensor(tensor)
    Note over Allocator: Race condition risk: reads atomic size<br/>without mutex, accesses memory vector
    Allocator-->>C_API: Return GroupedTensor*
    C_API->>Memory: Set parameter (data, first_dims, etc.)
    C_API-->>User: Parameter set

    User->>Validator: CheckInputGroupedTensor(tensor)
    Validator->>Validator: Check has_data() or has_columnwise_data()
    Validator->>Validator: CheckGroupedScaleInv()
    Validator->>Validator: CheckGroupedTensorShapeArrays()
    
    Note over Validator: Validates:<br/>- Shape arrays (first_dims, last_dims, tensor_offsets)<br/>- Logical shape is 2D<br/>- Data size matches logical_shape<br/>- Scale/scale_inv dtypes
    
    Validator-->>User: Validation result

    User->>C_API: nvte_destroy_grouped_tensor(tensor)
    C_API->>Allocator: Free(tensor)
    Allocator->>Memory: memory[index-1].clear()
    Allocator->>Allocator: Push index to free_list
    C_API-->>User: Tensor destroyed
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

phu0ngng and others added 2 commits November 24, 2025 16:40
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

phu0ngng and others added 2 commits November 24, 2025 16:52
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +451 to +452
void nvte_set_grouped_tensor_param(NVTEGroupedTensor *tensor, NVTEGroupedTensorParam param_name,
const NVTEBasicTensor *param);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works when we're setting basic tensors, but doesn't generalize to other types (bool/float/etc). Consider using a more general API like how we handle NVTEQuantizationConfig:

void nvte_set_quantization_config_attribute(NVTEQuantizationConfig config,
NVTEQuantizationConfigAttribute attr, const void *buf,
size_t size_in_bytes);

This is completely general, but also more cumbersome.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I don't think we can go with a similar API, i.e., using just void* buf and size_in_bytes as we do need different dtype for different fields.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet.

@timmoon10 timmoon10 self-requested a review November 25, 2025 02:47
phu0ngng and others added 2 commits November 25, 2025 09:55
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

phu0ngng and others added 2 commits November 26, 2025 10:28
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@phu0ngng
Copy link
Collaborator Author

/te-ci L0

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but this is hitting PyTorch test failures due to changes in Tensor::has_data/Tensor::has_columnwise_data. We should merge #2330 first.

phu0ngng and others added 3 commits November 26, 2025 13:50
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Collaborator Author

/te-ci L0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

NVTEGroupedTensor ret = reinterpret_cast<NVTEGroupedTensor>(index);
free_list.pop_back();
// 1-based indexing - fully reinitialize the tensor to avoid stale data
memory[index - 1].clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we clear it again?

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants