Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but accurate transformer-based model via distillation and re-ranking. Finally, we validate our approach on the Flickr30K image dataset where we show an increase in inference speed by several orders of magnitude while having results competitive to the state of the art. We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
翻译:我们的目标是以语言为基础搜索大型图像和视频数据集。 对于这项任务,由独立绘制文本和图像到联合嵌入空间( a.k.a.a. ) 的图像双编码器作为检索尺度具有吸引力,对近邻搜索的数十亿图像具有效率。使用具有交叉注意的视觉文本变压器的替代办法大大提高了联合嵌入的准确性,但鉴于每个样本在测试时所需的交叉注意机制的成本,大规模检索往往无法在实际中适用。这项工作结合了两个世界的最佳组合。我们做出了以下三项贡献:首先,我们为基于变压器的模型配备了新的精细的跨注意结构,在保持可缩放性的同时大大改进了检索的准确性。第二,我们采用了一种通用方法,通过蒸馏和重新排列,将快速的双编码器模型与我们慢速但准确的变压器模型结合起来。最后,我们验证了我们在Flick30K图像数据集上的做法,我们在该模型上展示了以若干级的推译速度。我们做了以下三个贡献。首先,我们把基于变压的变压的变压器模型装了几级,同时将VA级数据升级了我们的数据也提高了了VAYAYAYSet的状态数据系统。