fix(llm): prevent reranker context overflow on large chunks#234
Open
kevin-courbet wants to merge 1 commit intotobi:mainfrom
Open
fix(llm): prevent reranker context overflow on large chunks#234kevin-courbet wants to merge 1 commit intotobi:mainfrom
kevin-courbet wants to merge 1 commit intotobi:mainfrom
Conversation
Increase RERANK_CONTEXT_SIZE from 2048 to 8192 and add truncation safety net to prevent crashes when individual chunks exceed the context window. The Qwen3-Reranker model supports larger contexts, and 8192 tokens only uses ~4 GB VRAM with flash attention (vs ~960 MB at 2048), which is a reasonable trade-off for robustness. Additionally, before passing documents to rankAll(), estimate each document's token count (chars/4) and truncate any that would exceed the context size minus query and template overhead. This ensures the reranker never crashes even with unexpectedly large chunks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
LlamaRankingContext.rankAllthrows "input lengths exceed context size" when chunks exceed 2048 tokens. This happens with code blocks, non-ASCII text, or any content where the chunk + query + Qwen3 template overhead exceedsRERANK_CONTEXT_SIZE(2048).The crash is unrecoverable — the entire rerank call fails, which means search results come back unranked or not at all.
Fix
Two-part fix:
Increase
RERANK_CONTEXT_SIZEfrom 2048 → 8192. The Qwen3-Reranker model supports larger contexts. At 8192 with flash attention, VRAM usage is ~4 GB per context (vs ~960 MB at 2048) — a reasonable trade-off for robustness. Still 5× less than auto (40960).Add truncation safety net. Before passing documents to
rankAll(), estimate each document's token count (chars / 4) and truncate any that would exceed context size minus query and template overhead. This ensures the reranker never crashes even with unexpectedly large chunks.Changes
src/llm.ts: bumpRERANK_CONTEXT_SIZEto 8192, add pre-rankAlltruncation logicBackward compatible — no API or behavioral changes beyond improved resilience. Slightly higher VRAM usage per rerank context.
This fix was developed with AI assistance (Claude). The problem was discovered and validated in a production OpenClaw deployment using QMD with an RTX 5090.