This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modality retrieval. VL-UnivSearch encodes query and multi-modality sources in a universal embedding space for searching related candidates and routing modalities. To learn a tailored embedding space for multi-modality retrieval, VL-UnivSearch proposes two techniques: 1) Universal embedding optimization, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. VL-UnivSearch achieves the state-of-the-art on the multi-modality open-domain question answering benchmark, WebQA, and outperforms all retrieval models in each single modality task. It demonstrates that universal multi-modality search is feasible to replace the divide-and-conquer pipeline with a united model and also benefit per modality tasks. All source codes of this work will be released via Github.
翻译:本文介绍视野-语言通用搜索(VL-UniviSearch),该模型为多模式检索构建了统一的模型。 VL-UnivSearch 编码查询和多模式源,用于搜索相关候选人和路由模式的通用嵌入空间。为学习适合多模式检索的嵌入空间,VL-UnivSearch提出了两种技术:1) 通用嵌入优化,以不同方式平衡硬底片优化嵌入空间;2) 图像语言化方法,以弥合原始数据空间图像和文本之间的模式差距。VL-UnivSearch在多模式开放式问题回答基准、WebQA方面达到了最新水平,并超越了每个单一模式任务中的所有检索模式。它表明,通用的多模式搜索是可行的,可以用统一模式取代分解管道,也使每个模式的任务受益。这项工作的所有源代码将通过Githhub发布。