Enable U8 KV caching in SDPA operator for ARM by ashwins990 · Pull Request #33567 · openvinotoolkit/openvino

ashwins990 · 2026-01-13T05:06:51Z

[About]

This PR enables u8 kv cache precsion for SDPA operator and optimizes the same with NEON and SVE.

Improves the performance of OSS master [ where reference implementation is available ] version by 27%.
But we are slower by 2.7% when compared with non-quantized f16 cache precision due to additional overhead of quantization and dequantization for smaller models like TinyLlama-1.1B for single inference.
Such performance benefit [from u8 quantization] can be seen only when the inference is more memory bound. We see speedups around 3-5% when inferencing LLama-70B int8 quantized model for single Inference case.
Therefore, even though we achieve a speedup of 27% compared to reference implementation, we assume the general case to be compute bound and currently keeping the default as F16 only.
As models get larger and in multiple batch scenarios, by setting kv_cache as "u8" we see significant boost at inference level.

OSS ref impl - u8	This PR
10.8 tokens/sec	13.7 tokens/sec

Single inference performance on LLAMA2-7B model on 32c graviton machine.
The values are in TPS [ Tokens per second ].

This work is contributed by @ashwins990 & @abhijain1204fujitsu

maxnick · 2026-01-13T09:01:37Z

@alvoron, could you please review?

Copilot

Pull request overview

This PR enables U8 (uint8) key-value cache precision for the SDPA (Scaled Dot-Product Attention) operator on ARM architectures and provides optimized implementations using NEON and SVE instructions. The change improves performance over the reference implementation by 27% while maintaining memory efficiency through quantization, though it incurs a 2.7% overhead compared to F16 for smaller, compute-bound models.

Changes:

Added U8 KV cache quantization/dequantization support with ARM NEON and SVE optimizations
Implemented specialized dot product and accumulation functions for U8 precision with grouped quantization
Extended CMake build configuration to include NEON_FP16 compilation target

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp	Adds U8 KV cache support with optimized SIMD implementations for dot products and value accumulation
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp	Implements ARM NEON/SVE optimized min/max finding for quantization operations
src/plugins/intel_cpu/CMakeLists.txt	Adds NEON_FP16 architecture target for cross-compilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T09:02:31Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp

+                svfloat16_t a0 = svld1_f16(pg_b16, _a + i);
+                svfloat16_t a1 = svld1_f16(pg_b16, _a + i + offset + svcnth());


The variable a1 loads from _a + i + offset + svcnth() but should load from _a + offset + i + svcnth() to maintain consistent indexing with the corresponding b1 load on line 905 and the usage pattern throughout this loop.

Suggested change

svfloat16_t a0 = svld1_f16(pg_b16, _a + i);

svfloat16_t a1 = svld1_f16(pg_b16, _a + i + offset + svcnth());

svfloat16_t a0 = svld1_f16(pg_b16, _a + offset + i);

svfloat16_t a1 = svld1_f16(pg_b16, _a + offset + i + svcnth());

Copilot · 2026-01-13T09:02:32Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp

+        size_t offset = group_id * group_size;
+        float16_t group_scale = *(scale + group_id * 2);
+        float16_t group_zp = *(zp + group_id * 2);
+        while (group_id < n / group_size) {
+            float16_t group_sum = 0.0f;
+            i = 0;


Variables offset, group_scale, and group_zp are initialized before the while loop but never updated inside it. These should be moved inside the loop body after line 1090 to ensure they are recalculated for each group iteration.

Suggested change

size_t offset = group_id * group_size;

float16_t group_scale = *(scale + group_id * 2);

float16_t group_zp = *(zp + group_id * 2);

while (group_id < n / group_size) {

float16_t group_sum = 0.0f;

i = 0;

while (group_id < n / group_size) {

float16_t group_sum = 0.0f;

i = 0;

const size_t offset = group_id * group_size;

const float16_t group_scale = *(scale + group_id * 2);

const float16_t group_zp = *(zp + group_id * 2);

Copilot · 2026-01-13T09:02:32Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp

+    if constexpr (std::is_same_v<T, ov::float16>) {
+        auto v_max = vdupq_n_f16(max);
+        auto v_min = vdupq_n_f16(min);
+        for (; i + 8 < n; i += 8) {


The loop condition should be i + 8 <= n instead of i + 8 < n to process all complete 8-element vectors and be consistent with the float32 version on line 150 which uses i + 4 <= n.

Suggested change

for (; i + 8 < n; i += 8) {

for (; i + 8 <= n; i += 8) {

@maxnick , @alvoron
Code to resolve the issues as suggested by copilot has been pushed
Kindly support to review the PR.

alvoron · 2026-01-16T14:10:13Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp

@@ -118,6 +118,57 @@ void find_minmax(const T* src, size_t n, float& min, float& max) {
    hmin(v0_min);
    max = _mm256_cvtss_f32(v0_max);
    min = _mm256_cvtss_f32(v0_min);
+#elif defined(OPENVINO_ARCH_ARM64)


question to @maxnick: do we need to add a comment that ARM behavior differs from x86? ARM path uses fp16 accumulator while x86 - fp32

A comment would definitely be helpful.

maxnick · 2026-01-23T16:04:05Z

build_jenkins

alvoron · 2026-01-23T16:30:40Z

@ashwins990 could you please rebase the branch to pick up some fixes required to pass CI?

alvoron · 2026-01-26T12:54:07Z

@abhijain1204fujitsu could you please cover these changes by functional tests?

abhijain1204fujitsu · 2026-01-27T03:52:34Z

Hi @alvoron,
pushed changes to rebase the PR and resolve CI issues,
Kindly support to complete the review and merge the PR

alvoron · 2026-01-27T09:52:13Z

build_jenkins

ashwins990 · 2026-01-28T10:38:50Z

@abhijain1204fujitsu could you please cover these changes by functional tests?

Hi @alvoron,
By enabling test case for SDPA-u8 kv cache results in failing tests. I have enabled it here.

openvino/src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/arm/concat_sdp.cpp

Line 46 in a9b45d0

::testing::Values(false),

I believe the Reason of Failure is:
For inference precision in Fp16 with u8 Kv cache we expect some values to overflow. We allow this initially and account for the same with detail::handle_inf_value function. While the full inference output gives expected output, the numerical output of this operator varies between F32 and F16 inference precision. This is leading to the failure of test case, when enabled.

Is there any way to handle such scenarios, when the reference have different behaviour ?

alvoron · 2026-01-28T13:30:52Z

Is there any way to handle such scenarios, when the reference have different behaviour ?

We can try to tune threshold in ConcatSDPTest::SetUp() in m_forceKVU8 branch:

rel_threshold = 0.05f;

ashwins990 requested review from a team as code owners January 13, 2026 05:06

github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Jan 13, 2026

sys-openvino-ci added the ExternalPR External contributor label Jan 13, 2026

maxnick added the platform: arm OpenVINO on ARM / ARM64 label Jan 13, 2026

maxnick requested a review from alvoron January 13, 2026 09:01

maxnick assigned alvoron Jan 13, 2026

maxnick requested a review from Copilot January 13, 2026 09:01

Copilot AI reviewed Jan 13, 2026

View reviewed changes

alvoron reviewed Jan 16, 2026

View reviewed changes

alvoron approved these changes Jan 16, 2026

View reviewed changes

maxnick added this to the 2026.1 milestone Jan 23, 2026

abhijain1204fujitsu force-pushed the ARM-SVE-Quant_u8-support-for-SDPA branch from 32fa83a to 930d7b0 Compare January 27, 2026 03:51

ashwins990 and others added 4 commits February 5, 2026 10:25

Enable u8 Kv caching for SDPA for ARM

575a370

[CI Failure fixes] Refactor Code

91fea08

Rebased and added comments

679d946

Testcase enabled; code rebased

b10d1c6

abhijain1204fujitsu force-pushed the ARM-SVE-Quant_u8-support-for-SDPA branch from 930d7b0 to b10d1c6 Compare February 5, 2026 07:26

		svfloat16_t a0 = svld1_f16(pg_b16, _a + i);
		svfloat16_t a1 = svld1_f16(pg_b16, _a + i + offset + svcnth());

Conversation

ashwins990 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxnick commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

abhijain1204fujitsu Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

alvoron Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick commented Jan 23, 2026

Uh oh!

alvoron commented Jan 23, 2026

Uh oh!

alvoron commented Jan 26, 2026

Uh oh!

abhijain1204fujitsu commented Jan 27, 2026

Uh oh!

alvoron commented Jan 27, 2026

Uh oh!

ashwins990 commented Jan 28, 2026

Uh oh!

alvoron commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ashwins990 commented Jan 13, 2026 •

edited

Loading