Skip to content

Conversation

@liubo-intel
Copy link
Contributor

Details:

  • adds support for sink input parameter and sliding window attention mechanism in PagedAttention implementation with related test coverage.
    Key Changes
    Sink Input Support: Extended PagedAttentionExtension to accept optional sink input as the 21st parameter, updated executor to handle sink tokens in attention computation
    Sliding Window Attention: Implemented sliding window mechanism for both prefill and decode phases
    Test Coverage: Added comprehensive test suite comparing PA vs SDPA across multiple configurations (f32/bf16, with/without sink, normal/sliding window)

Tickets:

@liubo-intel liubo-intel requested review from a team as code owners November 26, 2025 10:49
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Nov 26, 2025
@liubo-intel liubo-intel force-pushed the liubo/support_PA_with_sink_input branch from e25dc47 to f42fb29 Compare November 26, 2025 11:16
@liubo-intel liubo-intel requested a review from Copilot November 27, 2025 01:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive support for sink input parameters and sliding window attention mechanisms in the PagedAttention implementation. The changes enable the system to handle attention patterns with constrained context windows and special sink tokens that remain accessible throughout the attention computation.

Key changes:

  • Extended PagedAttentionExtension to accept sink input as the 21st parameter, making it a required input
  • Implemented sliding window attention for both prefill and decode phases with proper masking logic
  • Added test coverage comparing PagedAttention vs SDPA across multiple configurations (f32/bf16, with/without sink, normal/sliding window)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp Added comprehensive test infrastructure for sink input and sliding window attention, including SDPA reference model updates and mask generation logic
src/plugins/intel_cpu/src/nodes/paged_attn.cpp Removed validation blocking sink input support and corrected precision type for sink tensor descriptor
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp Integrated sink input handling and sliding window logic into attention kernel execution paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yuxu42 yuxu42 requested a review from zhangYiIntel November 27, 2025 03:01
kCachePrecision.to_string() + " value cache prec " + vCachePrecision.to_string();
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ARM platform doesn't support sink input, maybe we still need a check here

const ScoreAggregationInfo* score_info_ptr) {
const ScoreAggregationInfo* score_info_ptr,
size_t batch_in_seq,
const PlainTensor& sinks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add parameter batch_in_seq, in init phase, code ensures that sink has shape {1, H, 1, 1}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @zhangYiIntel : considering the sinks batch dimension, agree that shape {1,H,1,1}(and the following address sinks.ptr<float>(0, h, 0, 0); ) should functionally work fine. but from the spec it seems only the last dimension is fixed to 1. how about we leave current dynamic dimension support incase batch separate sinks needed in future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick Hi Maksim what's your opinion, why do we check if sink has shape 1, H ,1, 1 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sink input in GPT oss is a const of shape {1, H, 1, 1}. And this is not a coincidence, because B and H are in fact dynamic dimensions, and we can't have them in a constant tensor. Thus we broadcast the 0th and 2nd axis of this sink constant. Therefore, as long as the sink input is expected to be constant, it's logical to assert {1, H, 1, 1} sinks tensor shape. The real question is - are we going to support a variable sink input? If not, then we leave this shape assert as it is, but also it does make sense to add a sanity check testing if this input is indeed constant.

const PlainTensor& alibi_slopes,
float* score_output) {
float* score_output,
size_t batch_in_seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save with above

? _d_scale * reinterpret_cast<float*>(_quantized_q.ptr<int8_t>(ithr, m - q_start, 0))[0]
: _d_scale;

// sink processing is independent of sliding_window size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sink has shape 1, H, 1, 1, we could simplify the code to

sink = sinks.ptr<float>(0, h, 0, 0);

nullptr,
alibi_slope);
float* sink = nullptr;
if (sinks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sink has shape 1, H, 1, 1, we could simplify the code to

sink = sinks.ptr<float>(0, h, 0, 0);

nullptr,
alibi_slope);
float* sink = nullptr;
if (sinks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sink has shape 1, H, 1, 1, we could simplify the code to

sink = sinks.ptr<float>(0, h, 0, 0);

alibi_slopes,
score_output);
score_output,
batch_in_seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove batch_in_seq

q_start_idx_score,
score_info_ptr);
score_info_ptr,
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove batch_in_seq

q_start_idx_score,
score_info_ptr);
score_info_ptr,
batch_in_seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

::testing::ValuesIn(additional_configs)),
PagedAttnTestBase::getTestCaseName);

// Sliding window test with same shapes as normal test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference between smoke_PagedAttnVSSDPATest and smoke_PagedAttnVSSDPATest_WithSlidingWindow is the slice_windows, why not combine them like

INSTANTIATE_TEST_SUITE_P(smoke_PagedAttnVSSDPATest,
                         PagedAttnVSSDPATest,
                         ::testing::Combine(::testing::Values(ElementType::f32, ElementType::bf16),
                                            ::testing::ValuesIn(inputShapeAndReorders),
                                            ::testing::Values(true, false),
                                            ::testing::Values(true, false),
                                            ::testing::Values(0, 8),  // sliding_window = 0, 8
                                            ::testing::ValuesIn(additional_configs)),
                         PagedAttnTestBase::getTestCaseName);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we used sperate smoke_PagedAttnVSSDPATest_WithSlidingWindow test case, because if combine it to PagedAttnVSSDPATest we need to generate the mask separately for fp32, bf16,(fp16). what's more the number of this test suit will increase from 25 to 40.
since PA SlidingWindow logic is some kind of sperate feature from input data type and other test parameters, how about we use one single separate test case to cover this logic and keep the test suit light?

Copy link
Contributor

@zhangYiIntel zhangYiIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, please fix the comments

@liubo-intel liubo-intel force-pushed the liubo/support_PA_with_sink_input branch from 7407cb0 to 0a391b2 Compare November 30, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants