Skip to content

[NPUW]Implement inplace kv cache copy when it's shared#33201

Open
DingZhangIntel wants to merge 19 commits intoopenvinotoolkit:masterfrom
DingZhangIntel:Ding/KVCacheCopy
Open

[NPUW]Implement inplace kv cache copy when it's shared#33201
DingZhangIntel wants to merge 19 commits intoopenvinotoolkit:masterfrom
DingZhangIntel:Ding/KVCacheCopy

Conversation

@DingZhangIntel
Copy link
Contributor

Details:

Implement inplace kv cache copy when it's shared

Tickets:

EISW-194492

@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Dec 11, 2025
@DingZhangIntel DingZhangIntel marked this pull request as ready for review December 18, 2025 08:25
@DingZhangIntel DingZhangIntel requested review from a team as code owners December 18, 2025 08:25
@dmatveev dmatveev added this to the 2026.0 milestone Dec 22, 2025
@dmatveev
Copy link
Contributor

@esmirno @AlexanderKalistratov please review

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell there may be tests required for this function?
Also can this benefit even more if it gets an AVX2 dispatch similar to our unpack group?

Copy link
Contributor

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add tests
  2. It feels overcomplicated. I don't think we really need 3 different implementations.
    First of all we need to assume that we are always continuous on the last dimension. So, we can always do std::memmove on the last dimension.
    Then implement generic inplace copy for the arbitrary number of dimensions.
    Then, if innermost dimension of src and dst tensors is equal then reshape tensors to lesser number of dimensions:
[1, D01, D02, D03], [1, D11, D12, D13]
If D03 == D13:
    reshape [1, D01, D02, D03] => [1, D01, D02*D03]
    reshape [1, D11, D12, D13] => [1, D11, D12*D03]

So we would have only one generic implementation which.
Generic implementation could be based on copy_inplace_columns_by_row_chunks but takes into consideration other dimensions strides/sizes to calculate correct offset for each line.

}
}

void ov::npuw::util::copy_inplace_columns_by_row_chunks(ov::SoPtr<ov::ITensor> src, ov::SoPtr<ov::ITensor>& dst) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void ov::npuw::util::copy_inplace_columns_by_row_chunks(const ov::SoPtr<ov::ITensor>& src, ov::SoPtr<ov::ITensor>& dst)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at latest commit.

}
}

void ov::npuw::util::copy_inplace_by_planes(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void ov::npuw::util::copy_inplace_by_planes(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Requirements:
// - kv_dim_src == kv_dim_dst, otherwise throws
// - src_tensor->data() == dst_tensor->data()
void ov::npuw::util::copy_tensor_inplace_by_dim(ov::SoPtr<ov::ITensor> src_tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void ov::npuw::util::copy_tensor_inplace_by_dim(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

void ov::npuw::util::copy_inplace(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void ov::npuw::util::copy_inplace(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// FIXME: Implement XARCH::unpack
LOG_INFO("######################## unpack_f8f16");
unpack_f8f16(from, scale, to, unpack_options);
//ov::npuw::util::XARCH::unpack_f8f16_scale(from, scale, to, unpack_options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of another PR i guess , please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder and removed the part.

auto t_start = std::chrono::high_resolution_clock::now();
copy_kvcache();
// End counting time.
auto t_end = std::chrono::high_resolution_clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use utils like: profiler and ms_to_run


    // Quick-and-dirty profiling
    using MS = ov::npuw::perf::metric<ov::npuw::perf::MSec>;
    using B = ov::npuw::perf::counter<ov::npuw::perf::Bytes>;

    MS m_ms_unpack;
    ov::npuw::perf::Profile<MS> m_profile;
    mutable ov::npuw::perf::Profile<B> m_footprint; 

   m_ms_unpack += ov::npuw::perf::ms_to_run([&](){
    ov::parallel_for(closure_copy_required.size(), [&](std::size_t j) {
        auto cidx = closure_copy_required[j];
        auto& closure = desc_closure[cidx];
        const auto closure_param_id = comp_model_desc.param_base + cidx;
        auto& iport = func_desc.compiled_model->inputs()[closure_param_id];
        auto clparam = request->get_tensor(iport);
        ov::get_tensor_impl(closure)->copy_to(clparam._ptr);
    });
   }); // ms_to_run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at latest commit .

// This is necessary because subsequent copy operations would overwrite the shared buffer
auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name));
ov::SoPtr<ov::ITensor> tmp_dense_kv_tensor;
auto kvcache_past_kv_chunks = uu::make_tensor_slice(kvcache_in_tensor,
Copy link
Contributor

@esmirno esmirno Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw sometimes this make_tensor gets called without namespace uu, but i dont see any using namespace stuff, so i would suggest align all usages, also as i see you've commented out implementation of make_tensor_slice - is this temporary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also try to switch to utils::view helper as it looks fully covered functionality of make_tensor_slice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I standardized the usage pattern and updated the code to consistently use uu::make_tensor_slice.
Also, there’s a small difference between uu::make_tensor_slice and utils::view:
The last parameter of uu::make_tensor_slice represents the end position, while the last parameter of utils::view represents the slice length. This difference wouldn’t prevent us from switching to utils::view, but to stay consistent with the other functions in llm_infer_request.cpp, I’m keeping uu::make_tensor_slice for now.

uint8_t* src_ptr = base + src_off;
uint8_t* dst_ptr = base + dst_off;
if (src_ptr != dst_ptr) {
std::memmove(dst_ptr, src_ptr, row_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is not an avx version but rather using memove ? Ok if that works we need exactly perf data, and i think tests as well for bunch of actual cases found in LLM workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the earlier ticket about optimizing copy, I tried replacing std::memcpy with an AVX2 implementation, but it resulted in almost no performance improvement. The std::memmove used here relies on essentially the same highly optimized underlying implementation as std::memcpy, so I didn’t pursue an additional AVX2 optimization in this case. I’m still running further measurements, and I’ll share more details once those tests are complete.

@dmatveev dmatveev modified the milestones: 2026.0, 2026.1 Jan 26, 2026
@dmatveev
Copy link
Contributor

CI keep failing (probably not related to the PR), effect is yet to be measured, removing the 26.0 relating tags to avoid gating the release.

kvcache_past_kv_chunks,
pre_kv_dim,
gen_kv_dim);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future readers need to add a comment here that in-place copy is not possible when we have v-transpose OFF/ON x-models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in the latest commit.

// Fallback: last dimension not packed in either src or dst.
// We cannot memmove row_bytes as a contiguous block. Do element-wise memmove.
// ---------------------------------------------------------------------
if (src_strides0[rank0 - 1] != elem_size || dst_strides0[rank0 - 1] != elem_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we really have such cases?
In which situation is it possible?

Also I'm not sure what about int4 data type? Would this check be still valid?

If this is a hypothetical situation I'd prefer to have assert instead.
If this is a real situation please move implementation to a separate function.

Copy link
Contributor Author

@DingZhangIntel DingZhangIntel Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last dimension is packed in current situation, so the element-wise fallback was unnecessary/hypothetical. I replaced that branch with OPENVINO_ASSERT, and added an assertion that sub-byte element types (e.g. int4/uint4) are not supported. If KV cache layout or element type changes in the future, I can introduce a dedicated implementation as a separate function.

return;
}

OPENVINO_ASSERT(src_strides0[rank0 - 1] == elem_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is save for now to assume kvcache is not int4. But it could be changed one day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit.

(src_strides0[inverted_idx] == dst_strides0[inverted_idx]);
if (ok) {
cut = inverted_idx;
if (inverted_idx == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make rank0, cut and inverted_idx to be int64_t.

So you can put more clear loop condition and remove this strange construct.
Btw. I'm not sure if you really need it even now. Is inverted_idx < rank0 not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept rank0/cut/inverted_idx as size_t to match ov::Shape/ov::Strides and rewrote the reverse loop using for (size_t i = rank0; i-- > 0;), which avoids the strange construct.

}

// Fold [cut..rank0-1] into a single last dimension.
ov::Shape shape;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all the logic related to the folding to a separate function. Starting from the loop which calculates cut

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants