Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
翻译:多模态编码器已显著推进了视觉文档检索的边界,通过将文本查询词元直接与图像块进行匹配,在公开基准测试中取得了最先进的性能。然而,依赖此范式的近期模型大幅扩展了查询与文档表征的规模,为实际部署和系统可扩展性带来了挑战。此外,纯视觉中心的方法可能受限于现代视觉-语言模型仍存在的固有模态鸿沟。本研究将这些挑战与混合检索范式联系起来,探讨轻量级稠密文本检索器能否增强更强大的视觉中心模型。现有混合方法依赖对排序或分数的粗粒度融合,未能充分利用各模型表征空间内的丰富交互。为此,我们提出引导式查询优化(GQR),一种新颖的测试时优化方法,通过辅助检索器的评分引导来优化主检索器的查询嵌入。在视觉文档检索基准上的大量实验表明,GQR使视觉中心模型能够达到表征规模显著更大的模型的性能水平,同时检索速度提升最高达14倍,内存需求减少54倍。我们的研究证明,GQR有效推动了多模态检索在性能与效率方面的帕累托前沿。代码已发布于 https://github.com/IBM/test-time-hybrid-retrieval