Skip to content

[None][fix] Split mContextChunkSize into per-target/draft fields#12058

Open
Hrithvik-Alex wants to merge 1 commit intoNVIDIA:mainfrom
Hrithvik-Alex:fix-separate-kv-cache-chunk-size
Open

[None][fix] Split mContextChunkSize into per-target/draft fields#12058
Hrithvik-Alex wants to merge 1 commit intoNVIDIA:mainfrom
Hrithvik-Alex:fix-separate-kv-cache-chunk-size

Conversation

@Hrithvik-Alex
Copy link

@Hrithvik-Alex Hrithvik-Alex commented Mar 10, 2026

Split the shared mContextChunkSize into mContextChunkSizeTarget and mContextChunkSizeDraft to prevent the draft KV cache manager from clobbering the target's chunk size during setPrepopulatedPromptLen, which could floor the target chunk size to 0 and cause q.numel()==0 assertion failures in MLA forward_context.

Made-with: Cursor

Summary by CodeRabbit

  • Chores
    • Improved internal context chunk size management to support differentiated handling across multiple configurations.

Description

While running separate KV Cache mode (with Eagle for example), we noticed that after a while of sustained traffic, our TRT instances would crash. After debugging, we noticed that this usually happened when the target and draft model had different hit rates for the same prompt, and this caused them to sequentially update the context chunk size during prefill, which caused the chunk size to hit zero and hit an assertion in the forward pass.

This is the code where it gets updated:

auto& prePromptLen = mUseDraftModel ? mPrepopulatedPromptLenDraft : mPrepopulatedPromptLenTarget;
auto& contextCurrentPosition = mUseDraftModel ? mContextCurrentPositionDraft : mContextCurrentPositionTarget;
prePromptLen = prepopulatedPromptLen;

if (prepopulatedPromptLen > 0)
        {
            // Currently, the runtime process is to apply for cache first and then determine prepopulation.
            // Use the prepopulated length to advance the context position and decrease chunk size if necessary.
            auto chunkSize = getContextChunkSize();
            if (prepopulatedPromptLen + chunkSize < promptLen)
            {
                // make sure to end at block boundary after current chunk
                auto const flooredEndPosition
                    = (prepopulatedPromptLen + chunkSize) / kvTokensPerBlock * kvTokensPerBlock;
                chunkSize = flooredEndPosition - prepopulatedPromptLen;
                TLLM_CHECK(chunkSize <= getContextChunkSize());
            }
            contextCurrentPosition = prepopulatedPromptLen;
            setContextChunkSize(chunkSize);

            if (!isLastContextChunk())
            {
                TLLM_CHECK_WITH_INFO((getContextCurrentPosition() + getContextChunkSize()) % kvTokensPerBlock == 0,
                    "To prevent cache fragmentation, the context position after current chunk should be divisible "
                    "by the number of tokens per block, except for the last chunk.");
            }
        }

It seems that some of the variables (like prePromptLen and contextCurrentPosition) were already split based on draft and target, so it feels natural to split chunk size this way also.

Test Coverage

We ran TRT with sustained traffic in the same setup as before after this fix, and noticed that the crashing issue never happened. This has been stable for a couple days.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Split the shared mContextChunkSize into mContextChunkSizeTarget and
mContextChunkSizeDraft to prevent the draft KV cache manager from
clobbering the target's chunk size during setPrepopulatedPromptLen,
which could floor the target chunk size to 0 and cause q.numel()==0
assertion failures in MLA forward_context.

Signed-off-by: Hrithvik <hrithvik.alex@baseten.co>
Made-with: Cursor
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

The changes refactor internal context chunk size tracking in the LLM request class by splitting a single mContextChunkSize member into two separate members: mContextChunkSizeTarget and mContextChunkSizeDraft. All constructors, accessors, and state management methods are updated to work with both variants, with behavior controlled by the mUseDraftModel flag. The public API surface remains functionally equivalent.

Changes

Cohort / File(s) Summary
Context Chunk Size Refactoring
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
Split single mContextChunkSize member into mContextChunkSizeTarget and mContextChunkSizeDraft. Updated all constructor initialization lists, getter/setter methods to conditionally access the appropriate variant based on mUseDraftModel, and state management methods (moveToNextContextChunk(), pause()) to manipulate both members. Public API behavior preserved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title is specific and clearly describes the main change: splitting mContextChunkSize into per-target/draft fields, which directly relates to the core modification in the changeset.
Description check ✅ Passed The PR description provides a clear explanation of the issue, root cause, and solution with code context and testing validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

1597-1610: Consider centralizing the active target/draft slot selection.

This PR already had to touch several mUseDraftModel ? draft : target branches. A small private helper returning the active chunk-size/current-position/prepopulated-length reference would make future state changes much harder to miss.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h` around lines 1597 -
1610, Several places use the ternary mUseDraftModel ? ... : ... to pick draft vs
target state (e.g., mContextChunkSizeDraft / mContextChunkSizeTarget,
current-position and prepopulated-length accesses) which is error-prone; add a
small private helper (e.g., activeContextChunkSize(), activeCurrentPosition(),
activePrepopulatedLength()) that returns a reference to the active member based
on mUseDraftModel, then replace all direct ternary selections in
setContextChunkSize and other methods with calls to these helpers and use
getContextRemainingLength() as before to clamp the size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1633-1636: moveToNextContextChunk() currently advances and zeroes
both model slots (mContextCurrentPositionDraft/Target and
mContextChunkSizeDraft/Target) unconditionally; change it to only advance and
clear the slot corresponding to the model being progressed so you don't drop the
inactive model's pending chunk. Locate moveToNextContextChunk() and replace the
unconditional updates with a branch that checks which slot is being advanced
(use the same condition logic as
getContextChunkSize()/getContextCurrentPosition(), e.g. mUseDraftModel or
equivalent flag): when advancing the draft slot update
mContextCurrentPositionDraft and set mContextChunkSizeDraft = 0, and when
advancing the target slot update mContextCurrentPositionTarget and set
mContextChunkSizeTarget = 0; if you intentionally need to advance both, ensure
the caller explicitly calls the function for each slot or add a two-slot advance
path to avoid implicit clearing.

---

Nitpick comments:
In `@cpp/include/tensorrt_llm/batch_manager/llmRequest.h`:
- Around line 1597-1610: Several places use the ternary mUseDraftModel ? ... :
... to pick draft vs target state (e.g., mContextChunkSizeDraft /
mContextChunkSizeTarget, current-position and prepopulated-length accesses)
which is error-prone; add a small private helper (e.g.,
activeContextChunkSize(), activeCurrentPosition(), activePrepopulatedLength())
that returns a reference to the active member based on mUseDraftModel, then
replace all direct ternary selections in setContextChunkSize and other methods
with calls to these helpers and use getContextRemainingLength() as before to
clamp the size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da9a29b9-d668-4270-ada9-782d4830d7b0

📥 Commits

Reviewing files that changed from the base of the PR and between e3d4ba7 and 5e7e8a8.

📒 Files selected for processing (1)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants