Skip to content

Conversation

xiaoxi-wangfj
Copy link
Contributor

Description

In the FP8 dataflow (dispatch → expert fc1) path, under 1F1B overlap, we need to switch to the work stream and safely release memory from the previous stream when passing a QuantizedTensor (FP8 payload plus scale_inv metadata) from dispatch to expert fc1. This reduces peak HBM usage and avoids unnecessary memory retention.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • torch_dispatch handler for aten.record_stream on QuantizedTensor:
    We record all relevant CUDA buffers inside the quantized tensor—_rowwise_data/_columnwise_data and their _rowwise_scale_inv/_columnwise_scale_inv—onto the provided stream via record_stream(stream). This does not change tensor values; it only updates storage lifetime metadata so the allocator won’t reuse/free the memory before the stream finishes its asynchronous work.

  • Expose QuantizedTensor.untyped_storage():
    Returns the payload’s underlying UntypedStorage. Callers can then run resize_(0) to immediately shrink the storage capacity to zero and return it to the caching allocator (on CUDA).

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…ensor

Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant