-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[CPU] Add PagedAttention Support for Sink Input and decode phases Sliding Window #33035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[CPU] Add PagedAttention Support for Sink Input and decode phases Sliding Window #33035
Conversation
e25dc47 to
f42fb29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive support for sink input parameters and sliding window attention mechanisms in the PagedAttention implementation. The changes enable the system to handle attention patterns with constrained context windows and special sink tokens that remain accessible throughout the attention computation.
Key changes:
- Extended PagedAttentionExtension to accept sink input as the 21st parameter, making it a required input
- Implemented sliding window attention for both prefill and decode phases with proper masking logic
- Added test coverage comparing PagedAttention vs SDPA across multiple configurations (f32/bf16, with/without sink, normal/sliding window)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp | Added comprehensive test infrastructure for sink input and sliding window attention, including SDPA reference model updates and mask generation logic |
| src/plugins/intel_cpu/src/nodes/paged_attn.cpp | Removed validation blocking sink input support and corrected precision type for sink tensor descriptor |
| src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp | Integrated sink input handling and sliding window logic into attention kernel execution paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/paged_attn.cpp
Show resolved
Hide resolved
| kCachePrecision.to_string() + " value cache prec " + vCachePrecision.to_string(); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ARM platform doesn't support sink input, maybe we still need a check here
| const ScoreAggregationInfo* score_info_ptr) { | ||
| const ScoreAggregationInfo* score_info_ptr, | ||
| size_t batch_in_seq, | ||
| const PlainTensor& sinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add parameter batch_in_seq, in init phase, code ensures that sink has shape {1, H, 1, 1}
| sinks.assert_dims({1, H, 1, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @zhangYiIntel : considering the sinks batch dimension, agree that shape {1,H,1,1}(and the following address sinks.ptr<float>(0, h, 0, 0); ) should functionally work fine. but from the spec it seems only the last dimension is fixed to 1. how about we leave current dynamic dimension support incase batch separate sinks needed in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxnick Hi Maksim what's your opinion, why do we check if sink has shape 1, H ,1, 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sink input in GPT oss is a const of shape {1, H, 1, 1}. And this is not a coincidence, because B and H are in fact dynamic dimensions, and we can't have them in a constant tensor. Thus we broadcast the 0th and 2nd axis of this sink constant. Therefore, as long as the sink input is expected to be constant, it's logical to assert {1, H, 1, 1} sinks tensor shape. The real question is - are we going to support a variable sink input? If not, then we leave this shape assert as it is, but also it does make sense to add a sanity check testing if this input is indeed constant.
| const PlainTensor& alibi_slopes, | ||
| float* score_output) { | ||
| float* score_output, | ||
| size_t batch_in_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save with above
| ? _d_scale * reinterpret_cast<float*>(_quantized_q.ptr<int8_t>(ithr, m - q_start, 0))[0] | ||
| : _d_scale; | ||
|
|
||
| // sink processing is independent of sliding_window size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sink has shape 1, H, 1, 1, we could simplify the code to
sink = sinks.ptr<float>(0, h, 0, 0);
| nullptr, | ||
| alibi_slope); | ||
| float* sink = nullptr; | ||
| if (sinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sink has shape 1, H, 1, 1, we could simplify the code to
sink = sinks.ptr<float>(0, h, 0, 0);
| nullptr, | ||
| alibi_slope); | ||
| float* sink = nullptr; | ||
| if (sinks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sink has shape 1, H, 1, 1, we could simplify the code to
sink = sinks.ptr<float>(0, h, 0, 0);
| alibi_slopes, | ||
| score_output); | ||
| score_output, | ||
| batch_in_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove batch_in_seq
| q_start_idx_score, | ||
| score_info_ptr); | ||
| score_info_ptr, | ||
| 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove batch_in_seq
| q_start_idx_score, | ||
| score_info_ptr); | ||
| score_info_ptr, | ||
| batch_in_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| ::testing::ValuesIn(additional_configs)), | ||
| PagedAttnTestBase::getTestCaseName); | ||
|
|
||
| // Sliding window test with same shapes as normal test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference between smoke_PagedAttnVSSDPATest and smoke_PagedAttnVSSDPATest_WithSlidingWindow is the slice_windows, why not combine them like
INSTANTIATE_TEST_SUITE_P(smoke_PagedAttnVSSDPATest,
PagedAttnVSSDPATest,
::testing::Combine(::testing::Values(ElementType::f32, ElementType::bf16),
::testing::ValuesIn(inputShapeAndReorders),
::testing::Values(true, false),
::testing::Values(true, false),
::testing::Values(0, 8), // sliding_window = 0, 8
::testing::ValuesIn(additional_configs)),
PagedAttnTestBase::getTestCaseName);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we used sperate smoke_PagedAttnVSSDPATest_WithSlidingWindow test case, because if combine it to PagedAttnVSSDPATest we need to generate the mask separately for fp32, bf16,(fp16). what's more the number of this test suit will increase from 25 to 40.
since PA SlidingWindow logic is some kind of sperate feature from input data type and other test parameters, how about we use one single separate test case to cover this logic and keep the test suit light?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, please fix the comments
7407cb0 to
0a391b2
Compare
Details:
Key Changes
Sink Input Support: Extended PagedAttentionExtension to accept optional sink input as the 21st parameter, updated executor to handle sink tokens in attention computation
Sliding Window Attention: Implemented sliding window mechanism for both prefill and decode phases
Test Coverage: Added comprehensive test suite comparing PA vs SDPA across multiple configurations (f32/bf16, with/without sink, normal/sliding window)
Tickets: