Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction - a masked visionlanguage modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task-level interaction - a KL-alignment learning objective is devised between text-to-image and image-to-text retrieval tasks, where the probability distribution per task is computed with the negative queues in momentum contrastive learning. Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10,800X faster in inference) w.r.t. the latest single-stream methods. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
翻译:大型单流前培训在图像文字检索方面表现显著。 遗憾的是,由于关注层的密集度,它面临低发效率。最近,CLIP和LALIGN等双流方法也表现出了有希望的业绩,然而,它们只考虑两种流之间的试级一致性(因此仍有改进的余地)。为了克服这些局限性,我们建议采用一个新的COTS 双层视觉预培训模式,即通过增强跨模式互动,为图像文字文字检索而使用COTS。除了通过对比性的势头学习来调整实例级别外,我们还利用了我们COTS中两种额外的交叉模式互动:(1) Token级互动――一个隐藏式的视觉语言模型(MVLM)学习目标的制定没有使用交叉流网络模块(因此仍有改进的余地)。为了克服这些局限性,我们提议在视觉编码中设置一个变式自动编码,为每个图像制作视觉标志。(2) 任务级别互动 - 一个 KLSL- 流级学习新状态在文本到图像检索任务之间设计一个新的目标。在文本和图像-图像检索任务中,在最短的文本上进行最高概率分布对比。