[NPUW]Implement inplace kv cache copy when it's shared#33201
[NPUW]Implement inplace kv cache copy when it's shared#33201DingZhangIntel wants to merge 19 commits intoopenvinotoolkit:masterfrom
Conversation
|
@esmirno @AlexanderKalistratov please review |
dmatveev
left a comment
There was a problem hiding this comment.
From what I can tell there may be tests required for this function?
Also can this benefit even more if it gets an AVX2 dispatch similar to our unpack group?
AlexanderKalistratov
left a comment
There was a problem hiding this comment.
- Please add tests
- It feels overcomplicated. I don't think we really need 3 different implementations.
First of all we need to assume that we are always continuous on the last dimension. So, we can always dostd::memmoveon the last dimension.
Then implement generic inplace copy for the arbitrary number of dimensions.
Then, if innermost dimension of src and dst tensors is equal then reshape tensors to lesser number of dimensions:
[1, D01, D02, D03], [1, D11, D12, D13]
If D03 == D13:
reshape [1, D01, D02, D03] => [1, D01, D02*D03]
reshape [1, D11, D12, D13] => [1, D11, D12*D03]
So we would have only one generic implementation which.
Generic implementation could be based on copy_inplace_columns_by_row_chunks but takes into consideration other dimensions strides/sizes to calculate correct offset for each line.
| } | ||
| } | ||
|
|
||
| void ov::npuw::util::copy_inplace_columns_by_row_chunks(ov::SoPtr<ov::ITensor> src, ov::SoPtr<ov::ITensor>& dst) { |
There was a problem hiding this comment.
void ov::npuw::util::copy_inplace_columns_by_row_chunks(const ov::SoPtr<ov::ITensor>& src, ov::SoPtr<ov::ITensor>& dst)There was a problem hiding this comment.
Fixed at latest commit.
| } | ||
| } | ||
|
|
||
| void ov::npuw::util::copy_inplace_by_planes(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) { |
There was a problem hiding this comment.
void ov::npuw::util::copy_inplace_by_planes(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)| // Requirements: | ||
| // - kv_dim_src == kv_dim_dst, otherwise throws | ||
| // - src_tensor->data() == dst_tensor->data() | ||
| void ov::npuw::util::copy_tensor_inplace_by_dim(ov::SoPtr<ov::ITensor> src_tensor, |
There was a problem hiding this comment.
void ov::npuw::util::copy_tensor_inplace_by_dim(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor, ...| } | ||
| } | ||
|
|
||
| void ov::npuw::util::copy_inplace(ov::SoPtr<ov::ITensor> src_tensor, ov::SoPtr<ov::ITensor> dst_tensor) { |
There was a problem hiding this comment.
void ov::npuw::util::copy_inplace(const ov::SoPtr<ov::ITensor>& src_tensor, ov::SoPtr<ov::ITensor>& dst_tensor)| // FIXME: Implement XARCH::unpack | ||
| LOG_INFO("######################## unpack_f8f16"); | ||
| unpack_f8f16(from, scale, to, unpack_options); | ||
| //ov::npuw::util::XARCH::unpack_f8f16_scale(from, scale, to, unpack_options); |
There was a problem hiding this comment.
this is part of another PR i guess , please remove
There was a problem hiding this comment.
Thanks for the reminder and removed the part.
| auto t_start = std::chrono::high_resolution_clock::now(); | ||
| copy_kvcache(); | ||
| // End counting time. | ||
| auto t_end = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
please use utils like: profiler and ms_to_run
// Quick-and-dirty profiling
using MS = ov::npuw::perf::metric<ov::npuw::perf::MSec>;
using B = ov::npuw::perf::counter<ov::npuw::perf::Bytes>;
MS m_ms_unpack;
ov::npuw::perf::Profile<MS> m_profile;
mutable ov::npuw::perf::Profile<B> m_footprint;
m_ms_unpack += ov::npuw::perf::ms_to_run([&](){
ov::parallel_for(closure_copy_required.size(), [&](std::size_t j) {
auto cidx = closure_copy_required[j];
auto& closure = desc_closure[cidx];
const auto closure_param_id = comp_model_desc.param_base + cidx;
auto& iport = func_desc.compiled_model->inputs()[closure_param_id];
auto clparam = request->get_tensor(iport);
ov::get_tensor_impl(closure)->copy_to(clparam._ptr);
});
}); // ms_to_run
| // This is necessary because subsequent copy operations would overwrite the shared buffer | ||
| auto prefill_past_kv = m_prefill_request->get_tensor(m_prefill_in_ports.at(input_name)); | ||
| ov::SoPtr<ov::ITensor> tmp_dense_kv_tensor; | ||
| auto kvcache_past_kv_chunks = uu::make_tensor_slice(kvcache_in_tensor, |
There was a problem hiding this comment.
btw sometimes this make_tensor gets called without namespace uu, but i dont see any using namespace stuff, so i would suggest align all usages, also as i see you've commented out implementation of make_tensor_slice - is this temporary?
There was a problem hiding this comment.
also try to switch to utils::view helper as it looks fully covered functionality of make_tensor_slice
There was a problem hiding this comment.
I standardized the usage pattern and updated the code to consistently use uu::make_tensor_slice.
Also, there’s a small difference between uu::make_tensor_slice and utils::view:
The last parameter of uu::make_tensor_slice represents the end position, while the last parameter of utils::view represents the slice length. This difference wouldn’t prevent us from switching to utils::view, but to stay consistent with the other functions in llm_infer_request.cpp, I’m keeping uu::make_tensor_slice for now.
| uint8_t* src_ptr = base + src_off; | ||
| uint8_t* dst_ptr = base + dst_off; | ||
| if (src_ptr != dst_ptr) { | ||
| std::memmove(dst_ptr, src_ptr, row_bytes); |
There was a problem hiding this comment.
so it is not an avx version but rather using memove ? Ok if that works we need exactly perf data, and i think tests as well for bunch of actual cases found in LLM workloads.
There was a problem hiding this comment.
In the earlier ticket about optimizing copy, I tried replacing std::memcpy with an AVX2 implementation, but it resulted in almost no performance improvement. The std::memmove used here relies on essentially the same highly optimized underlying implementation as std::memcpy, so I didn’t pursue an additional AVX2 optimization in this case. I’m still running further measurements, and I’ll share more details once those tests are complete.
|
CI keep failing (probably not related to the PR), effect is yet to be measured, removing the 26.0 relating tags to avoid gating the release. |
| kvcache_past_kv_chunks, | ||
| pre_kv_dim, | ||
| gen_kv_dim); | ||
| } else { |
There was a problem hiding this comment.
For the future readers need to add a comment here that in-place copy is not possible when we have v-transpose OFF/ON x-models.
| // Fallback: last dimension not packed in either src or dst. | ||
| // We cannot memmove row_bytes as a contiguous block. Do element-wise memmove. | ||
| // --------------------------------------------------------------------- | ||
| if (src_strides0[rank0 - 1] != elem_size || dst_strides0[rank0 - 1] != elem_size) { |
There was a problem hiding this comment.
So, do we really have such cases?
In which situation is it possible?
Also I'm not sure what about int4 data type? Would this check be still valid?
If this is a hypothetical situation I'd prefer to have assert instead.
If this is a real situation please move implementation to a separate function.
There was a problem hiding this comment.
I think last dimension is packed in current situation, so the element-wise fallback was unnecessary/hypothetical. I replaced that branch with OPENVINO_ASSERT, and added an assertion that sub-byte element types (e.g. int4/uint4) are not supported. If KV cache layout or element type changes in the future, I can introduce a dedicated implementation as a separate function.
| return; | ||
| } | ||
|
|
||
| OPENVINO_ASSERT(src_strides0[rank0 - 1] == elem_size); |
There was a problem hiding this comment.
I think it is save for now to assume kvcache is not int4. But it could be changed one day.
There was a problem hiding this comment.
Fixed in the latest commit.
| (src_strides0[inverted_idx] == dst_strides0[inverted_idx]); | ||
| if (ok) { | ||
| cut = inverted_idx; | ||
| if (inverted_idx == 0) { |
There was a problem hiding this comment.
Just make rank0, cut and inverted_idx to be int64_t.
So you can put more clear loop condition and remove this strange construct.
Btw. I'm not sure if you really need it even now. Is inverted_idx < rank0 not enough?
There was a problem hiding this comment.
I kept rank0/cut/inverted_idx as size_t to match ov::Shape/ov::Strides and rewrote the reverse loop using for (size_t i = rank0; i-- > 0;), which avoids the strange construct.
| } | ||
|
|
||
| // Fold [cut..rank0-1] into a single last dimension. | ||
| ov::Shape shape; |
There was a problem hiding this comment.
Let's move all the logic related to the folding to a separate function. Starting from the loop which calculates cut
There was a problem hiding this comment.
Done in the latest commit.
Details:
Implement inplace kv cache copy when it's shared
Tickets:
EISW-194492