Skip to content

[https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache#12032

Open
chenfeiz0326 wants to merge 2 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-bug5919026
Open

[https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache#12032
chenfeiz0326 wants to merge 2 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-bug5919026

Conversation

@chenfeiz0326
Copy link
Collaborator

@chenfeiz0326 chenfeiz0326 commented Mar 9, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed sparse attention configuration handling in multi-token prediction and one-model draft scenarios.
  • Tests

    • Re-enabled performance tests for deepseek v32 fp4 configurations that were previously skipped.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chenfeiz0326 chenfeiz0326 requested a review from QiJune March 9, 2026 10:41
@chenfeiz0326 chenfeiz0326 requested review from a team as code owners March 9, 2026 10:41
@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-2,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-3,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-3,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-5,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-6,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-7"

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Removes a defensive guard in sparse attention indexer preparation that previously skipped setup when kv_cache_manager lacked index_head_dim. Draft KV-cache creation now derives sparse attention config to enable proper handling in multi-token prediction scenarios. Two previously skipped performance tests are re-enabled.

Changes

Cohort / File(s) Summary
Sparse Attention Config Handling
tensorrt_llm/_torch/attention_backend/sparse/dsa.py, tensorrt_llm/_torch/pyexecutor/_util.py
Removed early-return guard in Indexer.prepare that skipped indexer preparation when kv_cache_manager was absent or lacked index_head_dim. Draft KV-cache creation now derives sparse_attn_config from draft model config instead of passing None, enabling sparse attention support in MTP/one-model scenarios.
Test Skip Removal
tests/integration/test_lists/waives.txt
Removed two SKIP annotations for DeepSeek V32 FP4 Grace Blackwell performance tests, re-enabling their execution.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description provided by the author is entirely blank—only the template is present with no actual content filled in. Add a description explaining what the PR changes, why the changes are needed, and what test coverage exists. The PR objectives indicate this fixes sparse attention config handling for one-model draft KV caches in MTP scenarios; include this context in the description.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: passing sparse_attn_config from effective_draft_config for one-model draft KV cache, addressing a specific bug identified by NVBugs ID.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38252 [ run ] triggered by Bot. Commit: 0dc5178 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38252 [ run ] completed with state SUCCESS. Commit: 0dc5178
/LLM/main/L0_MergeRequest_PR pipeline #29636 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast

@chenfeiz0326 chenfeiz0326 enabled auto-merge (squash) March 10, 2026 01:46
@tensorrt-cicd
Copy link
Collaborator

PR_Github #38345 [ run ] triggered by Bot. Commit: 0dc5178 Link to invocation

ziyixiong-nv and others added 2 commits March 10, 2026 01:00
…draft_config for one-model draft KV cache

In _create_one_model_draft_kv_cache_manager, the sparse_attn_config was
hardcoded to None. However, for MTP with models using sparse attention
(e.g., DeepSeek V3 with DSA), the draft layers share the same architecture
as the target model and need the sparse_attention_config.

The fix gets sparse_attn_config from effective_draft_config, which falls
back to the target model's config for MTP mode. This ensures DSACacheManager
is properly initialized with the required index_head_dim and other parameters.

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 force-pushed the chenfeiz/fix-bug5919026 branch from 0dc5178 to a2ec3b9 Compare March 10, 2026 08:02
@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38345 [ run ] completed with state SUCCESS. Commit: 0dc5178
/LLM/main/L0_MergeRequest_PR pipeline #29718 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38410 [ run ] triggered by Bot. Commit: a2ec3b9 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants