Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross-modal retrieval, we propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach which combines: 1) twin networks to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine-tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
翻译:目前对跨现代检索流程文本和视觉输入采用的最新方法,联合使用跨现代检索流程文本和视觉输入,依靠具有跨关注机制的变革型结构,在图像中覆盖所有单词和对象。这种模式在提供不匹配的检索性能的同时,具有以下特点:(1) 通常从零开始就容易发生故障,因此缩放较少,(2) 面临巨大的检索延迟和低效率问题,这使得它们在现实应用中不切实际。为了解决这些改进和高效跨模式检索方面的关键差距,我们提议建立一个新的微调框架,将任何预先培训过的文本模拟多模式转换为有效的检索模式。这个框架的基础是合作的检索和重置方法,它结合:(1) 双轨网络,将所有物质单独编码,从而能够有效初始检索;(2) 交叉编码部分,使其在更细化(即,更聪明)的小系列项目排序上不切实际。我们还提议联合调整具有共同重量的两个组成部分,从而产生一个更具有参数效率的模型。我们在一系列标准跨模式的跨模式的跨式检索基准上进行实验,在高水平的多语言、高语言、高语言、高水平的高度检索中,测试。