基于注意力机制的可视化文档检索增强方法 (Attention Grounded Enhancement for Visual Document Retrieval)

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.

翻译：可视化文档检索需要理解异构多模态内容以满足信息需求。近期研究采用基于截图的文档编码与细粒度延迟交互方法，显著提升了检索性能。然而，现有检索模型仍使用粗粒度的全局相关性标签进行训练，未能揭示支持匹配的具体区域。这导致检索器倾向于依赖表层线索，难以捕捉隐含的语义关联，限制了其处理非抽取式查询的能力。为缓解该问题，我们提出一种基于注意力机制的检索器增强框架。该框架利用多模态大语言模型的跨模态注意力作为代理局部监督信号，指导相关文档区域的识别。在训练过程中，框架将局部信号与全局信号相结合，联合优化检索器，使其不仅能学习文档是否匹配，还能理解驱动相关性的具体内容。在具有挑战性的ViDoRe V2基准测试上的实验表明，该框架显著优于仅使用全局监督的基线方法。定量与定性分析进一步证明，该框架促进了查询项与文档区域间更深层次的对齐，超越了表层匹配，实现了更精确且可解释的检索。代码已开源：https://anonymous.4open.science/r/AGREE-2025。