Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval. However, most of the existing VL transformers use early-interaction dataflow that computes a joint representation for the text-image input. In the retrieval stage, such models need to infer on all the matched text-image combinations, which causes high computing costs. The goal of this paper is to decompose the early-interaction dataflow inside the pre-trained VL transformer to achieve acceleration while maintaining its outstanding accuracy. To achieve this, we propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text through contrastive learning, which accelerates retrieval speed by thousands of times. Meanwhile, we propose to compose bi-modal hard negatives for the contrastive learning objective, which enables the VLDeformer to maintain the outstanding accuracy of the backbone VL transformer. Extensive experiments on COCO and Flickr30k datasets demonstrate the superior performance of the proposed method. Considering both effectiveness and efficiency, VLDeformer provides a superior selection for cross-modal retrieval in the similar pre-training datascale.
翻译:视觉变压器(VL变压器)在跨模式检索中表现出了令人印象深刻的准确性。然而,大多数现有的VL变压器使用早期互动数据流,计算出文本图像输入的共同表示。在检索阶段,这些模型需要在所有匹配的文本图像组合中进行推断,这会造成高计算成本。本文件的目的是分解预先训练的VL变压器内部的早期互动数据流,以达到加速,同时保持其突出的准确性。为了实现这一目标,我们提议使用一个新的VL变压器(VLDerew)来修改VL变压器,作为单个的编码器,通过对比性学习来计算单一图像或文本,从而加快检索速度数千次。与此同时,我们提议将双式硬负法对对比性学习目标进行计算,使VLDeformer能保持最出色的VL变压器的精度。关于CO和Flick30k数据集的广泛实验显示了拟议方法的优异性性性表现。考虑到效果和效率和效率,VLDSordeformainal reviewal realitional 数据选择了一个高级的跨级检索。