Skip to content

Moenupa/VTCBench

Repository files navigation

VTCBench: Can Vision-Language Models Understand Long Contexts with Vision-Text Compression?

github.com/Moenupa/VTCBench github.com/bjzhb666/VLMEvalKit

VTCBench is the first comprehensive benchmark specifically designed to evaluate the long-context understanding capabilities of Vision-Language Models (VLMs) within the Vision-Text Compression (VTC) paradigm.

VTC is an emerging framework that converts long texts into dense 2D visual representations (images), achieving token compression ratios of 2-10x compared to standard text tokenization. VTCBench rigorously assesses whether VLMs can actually understand this compressed information or if they are merely performing surface-level OCR.

🚀 Key Features

Three Core Tasks: Retrieval, Reasoning, and Memory

  • VTC-Retrieval: A visual "Needle-In-A-Haystack" (NIAH) test. Requires locating "needles" (key-value pairs) embedded within a large "haystack" of distractors.
  • VTC-Reasoning: Tests associative reasoning with minimized literal overlap between query and key, requiring inference of latent associations.
  • VTC-Memory: Multi-turn conversations testing long-term memory retention.

VTCBench-Wild

A wild-version designed to simulate real-world visual diversity (e.g., varying fonts, backgrounds, and layouts)

Two Evaluation Settings

  • Predefined VTC Ratio: Predetermines the compression ratio (e.g., $r_\text{VTC}=2.0$) to compare model intelligence at a standardized information density.
  • Predefined Rendering: Uses a fixed document format (12-pt Helvetica, 96 DPI) to simulate realistic document processing.

Extensive Model Coverage

Benchmarks 13 leading models including GPT-5, Gemini-2.5 Pro, Gemma, Glyph, Qwen2.5 & Qwen3 & InternVL3.5 series, and more.

Easily extensible to new models via our server-client evaluation framework.

📊 Benchmark Tasks

Task Task Categories Context Example Evaluation Example
VTC-Retrieval (NIAH) Lexical Matching, Multi-Hop Tracing, Aggregation Dynamic query/key-value with types: word-word, word-number, uuid-number. visual example
(essays...)
One of the special magic numbers for long-context is: 2026.
...One of the special magic numbers for distracting-information is: 2025.
QA Variant:
Q: What's the special magic number for long-context?
A: 2026.
Completion Variant:
Prompt: one of the special magic number for long-context is:
Completion: 2026.
VTC-Reasoning (NIAH) Associative Reasoning, Question-Answering Dynamic query/key-value with types: event/action-person. visual example.
(books...)
There was a vegan guest, named Katie.
One-Hop Reasoning:
Q: Which character cannot eat fish-based meals?
A: Katie.
Two-Hop Reasoning:
Q: Which character cannot eat Brandade meals?
A: Katie.
VTC-Memory (QA) Memory, Question-Answering No dynamic query/key-value, fully static. visual example.
(conversations...)
Caroline: Researching adoption agencies—it's been a dream to have a family and give a loving home to kids who need it.
Q: What did Caroline research?
A: Adoption agencies.
VTCBench-Wild All of the above A more challenging variant of the above tasks, introducing visual diversity to simulate real-world document conditions.

📈 Main Findings

vtcbench_results

  • Perception ≠ Comprehension: While many VLMs excel at OCR and simple retrieval, their performance collapses on reasoning and memory tasks compared to text-only LLMs.
  • Length Fragility: VLM performance degrades significantly as the context length increases (e.g., from 1k up to 32k tokens).
  • Parameter Sensitivity: VTC performance is highly sensitive to font size and the spatial positioning of information

🛠 Usage & Data

Please refer to the Usage Guide for instructions on how to use VTCBench.

📄 Citation

@misc{zhao2025vtcbenchvisionlanguagemodelsunderstand,
      title={VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?},
      author={Hongbo Zhao and Meng Wang and Fei Zhu and Wenzhuo Liu and Bolin Ni and Fanhu Zeng and Gaofeng Meng and Zhaoxiang Zhang},
      year={2025},
      eprint={2512.15649},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15649},
}