[NPUW]Implement inplace kv cache copy when it's shared by DingZhangIntel · Pull Request #33201 · openvinotoolkit/openvino

DingZhangIntel · 2025-12-11T08:23:12Z

Details:

Implement inplace kv cache copy when it's shared

Tickets:

EISW-194492

dmatveev · 2025-12-22T12:19:40Z

@esmirno @AlexanderKalistratov please review

dmatveev

From what I can tell there may be tests required for this function?
Also can this benefit even more if it gets an AVX2 dispatch similar to our unpack group?

AlexanderKalistratov

Please add tests
It feels overcomplicated. I don't think we really need 3 different implementations.
First of all we need to assume that we are always continuous on the last dimension. So, we can always do std::memmove on the last dimension.
Then implement generic inplace copy for the arbitrary number of dimensions.
Then, if innermost dimension of src and dst tensors is equal then reshape tensors to lesser number of dimensions:

[1, D01, D02, D03], [1, D11, D12, D13]
If D03 == D13:
    reshape [1, D01, D02, D03] => [1, D01, D02*D03]
    reshape [1, D11, D12, D13] => [1, D11, D12*D03]

So we would have only one generic implementation which.
Generic implementation could be based on copy_inplace_columns_by_row_chunks but takes into consideration other dimensions strides/sizes to calculate correct offset for each line.

AlexanderKalistratov · 2025-12-23T07:22:58Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

    }
 }

+void ov::npuw::util::copy_inplace_columns_by_row_chunks(ov::SoPtr<ov::ITensor> src, ov::SoPtr<ov::ITensor>& dst) {


void ov::npuw::util::copy_inplace_columns_by_row_chunks(const ov::SoPtr<ov::ITensor>& src, ov::SoPtr<ov::ITensor>& dst)

Fixed at latest commit.

AlexanderKalistratov · 2025-12-23T07:45:09Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+    }
+}
+
+void ov::npuw::util::copy_inplace_by_planes(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) {


void ov::npuw::util::copy_inplace_by_planes(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)

AlexanderKalistratov · 2025-12-23T12:43:34Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+// Requirements:
+//   - kv_dim_src == kv_dim_dst, otherwise throws
+//   - src_tensor->data() == dst_tensor->data()
+void ov::npuw::util::copy_tensor_inplace_by_dim(ov::SoPtr<ov::ITensor> src_tensor,


void ov::npuw::util::copy_tensor_inplace_by_dim(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor, ...

AlexanderKalistratov · 2025-12-23T12:44:27Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+    }
+}
+
+void ov::npuw::util::copy_inplace(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) {


void ov::npuw::util::copy_inplace(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)

esmirno · 2026-01-07T11:47:24Z

src/plugins/intel_npu/src/plugin/npuw/util.cpp

-        // FIXME: Implement XARCH::unpack
+        LOG_INFO("######################## unpack_f8f16");
        unpack_f8f16(from, scale, to, unpack_options);
+        //ov::npuw::util::XARCH::unpack_f8f16_scale(from, scale, to, unpack_options);


this is part of another PR i guess , please remove

Thanks for the reminder and removed the part.

esmirno · 2026-01-07T11:52:46Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+            auto t_start = std::chrono::high_resolution_clock::now();
            copy_kvcache();
+            // End counting time.
+            auto t_end = std::chrono::high_resolution_clock::now();


please use utils like: profiler and ms_to_run

// Quick-and-dirty profiling using MS = ov::npuw::perf::metric<ov::npuw::perf::MSec>; using B = ov::npuw::perf::counter<ov::npuw::perf::Bytes>; MS m_ms_unpack; ov::npuw::perf::Profile<MS> m_profile; mutable ov::npuw::perf::Profile<B> m_footprint; m_ms_unpack += ov::npuw::perf::ms_to_run([&](){ ov::parallel_for(closure_copy_required.size(), [&](std::size_t j) { auto cidx = closure_copy_required[j]; auto& closure = desc_closure[cidx]; const auto closure_param_id = comp_model_desc.param_base + cidx; auto& iport = func_desc.compiled_model->inputs()[closure_param_id]; auto clparam = request->get_tensor(iport); ov::get_tensor_impl(closure)->copy_to(clparam._ptr); }); }); // ms_to_run

Fixed at latest commit .

esmirno · 2026-01-07T11:55:56Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

                // This is necessary because subsequent copy operations would overwrite the shared buffer
                auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name));
-                ov::SoPtr<ov::ITensor> tmp_dense_kv_tensor;
+                auto kvcache_past_kv_chunks = uu::make_tensor_slice(kvcache_in_tensor,


btw sometimes this make_tensor gets called without namespace uu, but i dont see any using namespace stuff, so i would suggest align all usages, also as i see you've commented out implementation of make_tensor_slice - is this temporary?

also try to switch to utils::view helper as it looks fully covered functionality of make_tensor_slice

I standardized the usage pattern and updated the code to consistently use uu::make_tensor_slice.
Also, there’s a small difference between uu::make_tensor_slice and utils::view:
The last parameter of uu::make_tensor_slice represents the end position, while the last parameter of utils::view represents the slice length. This difference wouldn’t prevent us from switching to utils::view, but to stay consistent with the other functions in llm_infer_request.cpp, I’m keeping uu::make_tensor_slice for now.

esmirno · 2026-01-07T12:40:05Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+        uint8_t* src_ptr = base + src_off;
+        uint8_t* dst_ptr = base + dst_off;
+        if (src_ptr != dst_ptr) {
+            std::memmove(dst_ptr, src_ptr, row_bytes);


so it is not an avx version but rather using memove ? Ok if that works we need exactly perf data, and i think tests as well for bunch of actual cases found in LLM workloads.

In the earlier ticket about optimizing copy, I tried replacing std::memcpy with an AVX2 implementation, but it resulted in almost no performance improvement. The std::memmove used here relies on essentially the same highly optimized underlying implementation as std::memcpy, so I didn’t pursue an additional AVX2 optimization in this case. I’m still running further measurements, and I’ll share more details once those tests are complete.

dmatveev · 2026-01-26T10:07:44Z

CI keep failing (probably not related to the PR), effect is yet to be measured, removing the 26.0 relating tags to avoid gating the release.

dmatveev · 2026-01-27T14:58:27Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+                                                       kvcache_past_kv_chunks,
+                                                       pre_kv_dim,
+                                                       gen_kv_dim);
+                    } else {


For the future readers need to add a comment here that in-place copy is not possible when we have v-transpose OFF/ON x-models.

Added comments in the latest commit.

AlexanderKalistratov · 2026-01-28T21:05:02Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+    // Fallback: last dimension not packed in either src or dst.
+    // We cannot memmove row_bytes as a contiguous block. Do element-wise memmove.
+    // ---------------------------------------------------------------------
+    if (src_strides0[rank0 - 1] != elem_size || dst_strides0[rank0 - 1] != elem_size) {


So, do we really have such cases?
In which situation is it possible?

Also I'm not sure what about int4 data type? Would this check be still valid?

If this is a hypothetical situation I'd prefer to have assert instead.
If this is a real situation please move implementation to a separate function.

I think last dimension is packed in current situation, so the element-wise fallback was unnecessary/hypothetical. I replaced that branch with OPENVINO_ASSERT, and added an assertion that sub-byte element types (e.g. int4/uint4) are not supported. If KV cache layout or element type changes in the future, I can introduce a dedicated implementation as a separate function.

AlexanderKalistratov · 2026-01-28T21:08:14Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+        return;
+    }
+
+    OPENVINO_ASSERT(src_strides0[rank0 - 1] == elem_size);


I think it is save for now to assume kvcache is not int4. But it could be changed one day.

Fixed in the latest commit.

AlexanderKalistratov · 2026-01-28T21:17:03Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+                        (src_strides0[inverted_idx] == dst_strides0[inverted_idx]);
+        if (ok) {
+            cut = inverted_idx;
+            if (inverted_idx == 0) {


Just make rank0, cut and inverted_idx to be int64_t.

So you can put more clear loop condition and remove this strange construct.
Btw. I'm not sure if you really need it even now. Is inverted_idx < rank0 not enough?

I kept rank0/cut/inverted_idx as size_t to match ov::Shape/ov::Strides and rewrote the reverse loop using for (size_t i = rank0; i-- > 0;), which avoids the strange construct.

AlexanderKalistratov · 2026-01-28T21:21:34Z

src/plugins/intel_npu/src/plugin/npuw/infer_request_utils.cpp

+    }
+
+    // Fold [cut..rank0-1] into a single last dimension.
+    ov::Shape shape;


Let's move all the logic related to the folding to a separate function. Starting from the loop which calculates cut

Done in the latest commit.

Implement inplace kv cache copy when it's shared

35789cc

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Dec 11, 2025

DingZhangIntel added 2 commits December 16, 2025 17:36

Optimize and Fix

5eada49

Fix and optimize

40d955a

DingZhangIntel marked this pull request as ready for review December 18, 2025 08:25

DingZhangIntel requested review from a team as code owners December 18, 2025 08:25

DingZhangIntel added 2 commits December 22, 2025 15:29

Merge branch 'master' into Ding/KVCacheCopy

5f81709

Merge branch 'master' into Ding/KVCacheCopy

fb7e815

dmatveev added this to the 2026.0 milestone Dec 22, 2025

dmatveev reviewed Dec 22, 2025

View reviewed changes

AlexanderKalistratov reviewed Dec 24, 2025

View reviewed changes

DingZhangIntel added 3 commits January 4, 2026 16:29

Merge branch 'master' into Ding/KVCacheCopy

b33cc18

Refactor

afd4418

Add tests

aa539a2

esmirno reviewed Jan 7, 2026

View reviewed changes

DingZhangIntel added 7 commits January 15, 2026 17:08

add unit tests and optimize

c874ad4

Fix

42f5cf5

Merge branch 'master' into Ding/KVCacheCopy

12d0d7a

Fix

d807b91

Format

e4e9fad

Fix

0313761

Optimize offset computing and clean up

ce64e7e

dmatveev added Code Freeze and removed Code Freeze labels Jan 22, 2026

dmatveev modified the milestones: 2026.0, 2026.1 Jan 26, 2026

DingZhangIntel added 2 commits January 26, 2026 18:32

Merge branch 'master' into Ding/KVCacheCopy

4e32de4

Merge branch 'master' into Ding/KVCacheCopy

897fd5c

dmatveev reviewed Jan 27, 2026

View reviewed changes

Add necessary comments

f58df15

AlexanderKalistratov reviewed Jan 28, 2026

View reviewed changes

Fix and refatore

9067950

Conversation

DingZhangIntel commented Dec 11, 2025

Details:

Tickets:

Uh oh!

dmatveev commented Dec 22, 2025

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esmirno Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev commented Jan 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DingZhangIntel Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

esmirno Jan 7, 2026 •

edited

Loading

DingZhangIntel Jan 29, 2026 •

edited

Loading