Skip to content

Conversation

@qyh111
Copy link
Contributor

@qyh111 qyh111 commented Oct 28, 2025

Purpose

Add chunksize
What this PR does / why we need it?

Modifications

原先的接口

# 一个block的一个offset对应于HBM上的一个tensor
def load(self, block_ids: List[str], offset: List[int], dst_tensor: List[torch.Tensor]) -> Task:
def fetch_data(self, block_ids: List[str], offset: List[int], dst_addr: List[int], size: List[int]) -> Task:

现在需要改成

# 一个block的一个offset对应于HBM的n个tensor/n个address,这n个tensor不能单独dump,需要进行聚合
def load(self, block_ids: List[str], offset: List[int], dst_tensor: List[List[torch.Tensor]]) -> Task:
def fetch_data(self, block_ids: List[str], offset: List[int], dst_addr: List[List[int]], size: List[int]) -> Task:

或者说不改原来的接口,新增一个dump_data_batch的接口

struct Shard {
        Type type;
        Location location;
        std::string block;
        size_t offset;
        std::vector<uintptr_t> address; //一个offset对于多个地址
        size_t length; // n个tensor的总大小
        size_t owner;
        std::shared_ptr<void> buffer;
        std::function<void(void)> done;

Dump时在posix_queue里面将多个address聚合到预先申请的buffer中

shard.buffer = device->GetBuffer(shard.length);
    if (!shard.buffer) {
        UC_ERROR("Out of memory({}).", shard.length);
        return Status::OutOfMemory();
    }
    auto hub = shard.buffer.get();
    auto dAddr = new std::byte*[shard.address.size()];
    auto hAddr = new std::byte*[shard.address.size()];
    for (size_t i = 0; i < shard.address.size(); i++) {
        hAddr[i] = (std::byte*)hub + i * shard.length / shard.address.size();
        dAddr[i] = (std::byte*)shard.address[i];
    }
    auto status = device->D2HBatchSync(hAddr, const_cast<const std::byte**>(dAddr), shard.address.size(), shard.length / shard.address.size());

Does this PR introduce any user-facing change?

Test

How was this patch tested?

@qyh111 qyh111 force-pushed the dev_chunk_size branch 2 times, most recently from ff28f53 to e562c1c Compare October 29, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant