[None][fix] Split mContextChunkSize into per-target/draft fields#12058
[None][fix] Split mContextChunkSize into per-target/draft fields#12058Hrithvik-Alex wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Split the shared mContextChunkSize into mContextChunkSizeTarget and mContextChunkSizeDraft to prevent the draft KV cache manager from clobbering the target's chunk size during setPrepopulatedPromptLen, which could floor the target chunk size to 0 and cause q.numel()==0 assertion failures in MLA forward_context. Signed-off-by: Hrithvik <hrithvik.alex@baseten.co> Made-with: Cursor
📝 WalkthroughWalkthroughThe changes refactor internal context chunk size tracking in the LLM request class by splitting a single Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
1597-1610: Consider centralizing the active target/draft slot selection.This PR already had to touch several
mUseDraftModel ? draft : targetbranches. A small private helper returning the active chunk-size/current-position/prepopulated-length reference would make future state changes much harder to miss.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h` around lines 1597 - 1610, Several places use the ternary mUseDraftModel ? ... : ... to pick draft vs target state (e.g., mContextChunkSizeDraft / mContextChunkSizeTarget, current-position and prepopulated-length accesses) which is error-prone; add a small private helper (e.g., activeContextChunkSize(), activeCurrentPosition(), activePrepopulatedLength()) that returns a reference to the active member based on mUseDraftModel, then replace all direct ternary selections in setContextChunkSize and other methods with calls to these helpers and use getContextRemainingLength() as before to clamp the size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1633-1636: moveToNextContextChunk() currently advances and zeroes
both model slots (mContextCurrentPositionDraft/Target and
mContextChunkSizeDraft/Target) unconditionally; change it to only advance and
clear the slot corresponding to the model being progressed so you don't drop the
inactive model's pending chunk. Locate moveToNextContextChunk() and replace the
unconditional updates with a branch that checks which slot is being advanced
(use the same condition logic as
getContextChunkSize()/getContextCurrentPosition(), e.g. mUseDraftModel or
equivalent flag): when advancing the draft slot update
mContextCurrentPositionDraft and set mContextChunkSizeDraft = 0, and when
advancing the target slot update mContextCurrentPositionTarget and set
mContextChunkSizeTarget = 0; if you intentionally need to advance both, ensure
the caller explicitly calls the function for each slot or add a two-slot advance
path to avoid implicit clearing.
---
Nitpick comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1597-1610: Several places use the ternary mUseDraftModel ? ... :
... to pick draft vs target state (e.g., mContextChunkSizeDraft /
mContextChunkSizeTarget, current-position and prepopulated-length accesses)
which is error-prone; add a small private helper (e.g.,
activeContextChunkSize(), activeCurrentPosition(), activePrepopulatedLength())
that returns a reference to the active member based on mUseDraftModel, then
replace all direct ternary selections in setContextChunkSize and other methods
with calls to these helpers and use getContextRemainingLength() as before to
clamp the size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: da9a29b9-d668-4270-ada9-782d4830d7b0
📒 Files selected for processing (1)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
Split the shared mContextChunkSize into mContextChunkSizeTarget and mContextChunkSizeDraft to prevent the draft KV cache manager from clobbering the target's chunk size during setPrepopulatedPromptLen, which could floor the target chunk size to 0 and cause q.numel()==0 assertion failures in MLA forward_context.
Made-with: Cursor
Summary by CodeRabbit
Description
While running separate KV Cache mode (with Eagle for example), we noticed that after a while of sustained traffic, our TRT instances would crash. After debugging, we noticed that this usually happened when the target and draft model had different hit rates for the same prompt, and this caused them to sequentially update the context chunk size during prefill, which caused the chunk size to hit zero and hit an assertion in the forward pass.
This is the code where it gets updated:
It seems that some of the variables (like prePromptLen and contextCurrentPosition) were already split based on draft and target, so it feels natural to split chunk size this way also.
Test Coverage
We ran TRT with sustained traffic in the same setup as before after this fix, and noticed that the crashing issue never happened. This has been stable for a couple days.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.