[None][fix] Split mContextChunkSize into per-target/draft fields by Hrithvik-Alex · Pull Request #12058 · NVIDIA/TensorRT-LLM

Hrithvik-Alex · 2026-03-10T02:13:35Z

Split the shared mContextChunkSize into mContextChunkSizeTarget and mContextChunkSizeDraft to prevent the draft KV cache manager from clobbering the target's chunk size during setPrepopulatedPromptLen, which could floor the target chunk size to 0 and cause q.numel()==0 assertion failures in MLA forward_context.

Made-with: Cursor

Summary by CodeRabbit

Chores
- Improved internal context chunk size management to support differentiated handling across multiple configurations.

Description

While running separate KV Cache mode (with Eagle for example), we noticed that after a while of sustained traffic, our TRT instances would crash. After debugging, we noticed that this usually happened when the target and draft model had different hit rates for the same prompt, and this caused them to sequentially update the context chunk size during prefill, which caused the chunk size to hit zero and hit an assertion in the forward pass.

This is the code where it gets updated:

auto& prePromptLen = mUseDraftModel ? mPrepopulatedPromptLenDraft : mPrepopulatedPromptLenTarget;
auto& contextCurrentPosition = mUseDraftModel ? mContextCurrentPositionDraft : mContextCurrentPositionTarget;
prePromptLen = prepopulatedPromptLen;

if (prepopulatedPromptLen > 0)
        {
            // Currently, the runtime process is to apply for cache first and then determine prepopulation.
            // Use the prepopulated length to advance the context position and decrease chunk size if necessary.
            auto chunkSize = getContextChunkSize();
            if (prepopulatedPromptLen + chunkSize < promptLen)
            {
                // make sure to end at block boundary after current chunk
                auto const flooredEndPosition
                    = (prepopulatedPromptLen + chunkSize) / kvTokensPerBlock * kvTokensPerBlock;
                chunkSize = flooredEndPosition - prepopulatedPromptLen;
                TLLM_CHECK(chunkSize <= getContextChunkSize());
            }
            contextCurrentPosition = prepopulatedPromptLen;
            setContextChunkSize(chunkSize);

            if (!isLastContextChunk())
            {
                TLLM_CHECK_WITH_INFO((getContextCurrentPosition() + getContextChunkSize()) % kvTokensPerBlock == 0,
                    "To prevent cache fragmentation, the context position after current chunk should be divisible "
                    "by the number of tokens per block, except for the last chunk.");
            }
        }

It seems that some of the variables (like prePromptLen and contextCurrentPosition) were already split based on draft and target, so it feels natural to split chunk size this way also.

Test Coverage

We ran TRT with sustained traffic in the same setup as before after this fix, and noticed that the crashing issue never happened. This has been stable for a couple days.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Split the shared mContextChunkSize into mContextChunkSizeTarget and mContextChunkSizeDraft to prevent the draft KV cache manager from clobbering the target's chunk size during setPrepopulatedPromptLen, which could floor the target chunk size to 0 and cause q.numel()==0 assertion failures in MLA forward_context. Signed-off-by: Hrithvik <hrithvik.alex@baseten.co> Made-with: Cursor

coderabbitai · 2026-03-10T02:23:06Z

📝 Walkthrough

Walkthrough

The changes refactor internal context chunk size tracking in the LLM request class by splitting a single mContextChunkSize member into two separate members: mContextChunkSizeTarget and mContextChunkSizeDraft. All constructors, accessors, and state management methods are updated to work with both variants, with behavior controlled by the mUseDraftModel flag. The public API surface remains functionally equivalent.

Changes

Cohort / File(s)	Summary
Context Chunk Size Refactoring `cpp/include/tensorrt_llm/batch_manager/llmRequest.h`	Split single `mContextChunkSize` member into `mContextChunkSizeTarget` and `mContextChunkSizeDraft`. Updated all constructor initialization lists, getter/setter methods to conditionally access the appropriate variant based on `mUseDraftModel`, and state management methods (`moveToNextContextChunk()`, `pause()`) to manipulate both members. Public API behavior preserved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title is specific and clearly describes the main change: splitting mContextChunkSize into per-target/draft fields, which directly relates to the core modification in the changeset.
Description check	✅ Passed	The PR description provides a clear explanation of the issue, root cause, and solution with code context and testing validation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
1597-1610: Consider centralizing the active target/draft slot selection.

This PR already had to touch several mUseDraftModel ? draft : target branches. A small private helper returning the active chunk-size/current-position/prepopulated-length reference would make future state changes much harder to miss.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h` around lines 1597 -
1610, Several places use the ternary mUseDraftModel ? ... : ... to pick draft vs
target state (e.g., mContextChunkSizeDraft / mContextChunkSizeTarget,
current-position and prepopulated-length accesses) which is error-prone; add a
small private helper (e.g., activeContextChunkSize(), activeCurrentPosition(),
activePrepopulatedLength()) that returns a reference to the active member based
on mUseDraftModel, then replace all direct ternary selections in
setContextChunkSize and other methods with calls to these helpers and use
getContextRemainingLength() as before to clamp the size.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1633-1636: moveToNextContextChunk() currently advances and zeroes
both model slots (mContextCurrentPositionDraft/Target and
mContextChunkSizeDraft/Target) unconditionally; change it to only advance and
clear the slot corresponding to the model being progressed so you don't drop the
inactive model's pending chunk. Locate moveToNextContextChunk() and replace the
unconditional updates with a branch that checks which slot is being advanced
(use the same condition logic as
getContextChunkSize()/getContextCurrentPosition(), e.g. mUseDraftModel or
equivalent flag): when advancing the draft slot update
mContextCurrentPositionDraft and set mContextChunkSizeDraft = 0, and when
advancing the target slot update mContextCurrentPositionTarget and set
mContextChunkSizeTarget = 0; if you intentionally need to advance both, ensure
the caller explicitly calls the function for each slot or add a two-slot advance
path to avoid implicit clearing.

---

Nitpick comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1597-1610: Several places use the ternary mUseDraftModel ? ... :
... to pick draft vs target state (e.g., mContextChunkSizeDraft /
mContextChunkSizeTarget, current-position and prepopulated-length accesses)
which is error-prone; add a small private helper (e.g.,
activeContextChunkSize(), activeCurrentPosition(), activePrepopulatedLength())
that returns a reference to the active member based on mUseDraftModel, then
replace all direct ternary selections in setContextChunkSize and other methods
with calls to these helpers and use getContextRemainingLength() as before to
clamp the size.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da9a29b9-d668-4270-ada9-782d4830d7b0

📥 Commits

Reviewing files that changed from the base of the PR and between e3d4ba7 and 5e7e8a8.

📒 Files selected for processing (1)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

cpp/include/tensorrt_llm/batch_manager/llmRequest.h Show resolved Hide resolved

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Split mContextChunkSize into per-target/draft fields#12058

[None][fix] Split mContextChunkSize into per-target/draft fields#12058
Hrithvik-Alex wants to merge 1 commit intoNVIDIA:mainfrom
Hrithvik-Alex:fix-separate-kv-cache-chunk-size

Hrithvik-Alex commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hrithvik-Alex commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Hrithvik-Alex commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading