Skip to content

Comments

fix(llm): prevent reranker context overflow on large chunks#234

Open
kevin-courbet wants to merge 1 commit intotobi:mainfrom
kevin-courbet:fix/reranker-context-overflow
Open

fix(llm): prevent reranker context overflow on large chunks#234
kevin-courbet wants to merge 1 commit intotobi:mainfrom
kevin-courbet:fix/reranker-context-overflow

Conversation

@kevin-courbet
Copy link

Problem

LlamaRankingContext.rankAll throws "input lengths exceed context size" when chunks exceed 2048 tokens. This happens with code blocks, non-ASCII text, or any content where the chunk + query + Qwen3 template overhead exceeds RERANK_CONTEXT_SIZE (2048).

The crash is unrecoverable — the entire rerank call fails, which means search results come back unranked or not at all.

Fix

Two-part fix:

  1. Increase RERANK_CONTEXT_SIZE from 2048 → 8192. The Qwen3-Reranker model supports larger contexts. At 8192 with flash attention, VRAM usage is ~4 GB per context (vs ~960 MB at 2048) — a reasonable trade-off for robustness. Still 5× less than auto (40960).

  2. Add truncation safety net. Before passing documents to rankAll(), estimate each document's token count (chars / 4) and truncate any that would exceed context size minus query and template overhead. This ensures the reranker never crashes even with unexpectedly large chunks.

Changes

  • src/llm.ts: bump RERANK_CONTEXT_SIZE to 8192, add pre-rankAll truncation logic

Backward compatible — no API or behavioral changes beyond improved resilience. Slightly higher VRAM usage per rerank context.


This fix was developed with AI assistance (Claude). The problem was discovered and validated in a production OpenClaw deployment using QMD with an RTX 5090.

Increase RERANK_CONTEXT_SIZE from 2048 to 8192 and add truncation safety
net to prevent crashes when individual chunks exceed the context window.

The Qwen3-Reranker model supports larger contexts, and 8192 tokens only
uses ~4 GB VRAM with flash attention (vs ~960 MB at 2048), which is a
reasonable trade-off for robustness.

Additionally, before passing documents to rankAll(), estimate each
document's token count (chars/4) and truncate any that would exceed the
context size minus query and template overhead. This ensures the reranker
never crashes even with unexpectedly large chunks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant