VTCBench：视觉语言模型能否通过视觉文本压缩理解长上下文？ (VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?)

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

翻译：扩展大语言模型上下文窗口所带来的计算和内存开销严重限制了其可扩展性。视觉文本压缩作为一种值得关注的解决方案，以DeepSeek-OCR和Glyph等框架为代表，通过将长文本转换为密集的二维视觉表示，实现了3倍至20倍的令牌压缩比。然而，这种高信息密度对视觉语言模型核心长上下文理解能力的影响尚未得到充分研究。为填补这一空白，我们首次提出了视觉文本压缩基准测试，并系统评估了视觉语言模型在三种长上下文理解场景下的表现：VTC-检索（评估模型检索与聚合信息的能力）、VTC-推理（要求模型通过推断潜在关联来定位词汇重叠度极低的事实）以及VTC-记忆（测量长期对话记忆中的综合问答能力）。此外，我们建立了VTCBench-Wild以模拟多样化输入场景。我们在基准测试中对主流开源与商业模型进行了全面评估。结果表明，尽管大多数视觉语言模型能够较好解码文本信息（如OCR），但在处理视觉文本压缩信息时却表现出惊人的长上下文理解缺陷，无法捕捉上下文中的长程关联或依赖关系。本研究深化了对视觉文本压缩机制的理解，并为设计更高效、可扩展的视觉语言模型奠定了基础。