Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on six main downstream tasks, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR), Region-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot Text-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior performance and reduce the fine-tuning time by a large margin (in particular, 76.17%). Extensive experiments and ablation studies demonstrate the efficiency of contrastive pre-training and adaptive fine-tuning proposed in our CAVL.
翻译:视觉和语言预训练旨在共同学习视觉和语言表示,并可转移到视觉-语言下游任务中。然而,在预训练阶段存在语言和视觉之间的语义混淆。此外,当前的预训练模型往往需要大量的计算资源进行微调,以便在下游任务中使用。在本文中,我们提出了一种学习视觉和语言对比和自适应表示的简单但有效的方法,即CAVL。具体而言,在预训练过程中,我们引入了一种成对的对比损失,以在相同批次中学习整个句子和每个图像之间的对齐。在微调阶段,我们引入了两个轻量级自适应网络,以减少模型参数并增加训练速度,从而节省计算资源。我们评估了我们的CAVL在六个主要的下游任务上,包括视觉问答(VQA)、视觉通感推理(VCR)、自然语言视觉推理(NLVR)、区域到短语定位(RPG)、文本到图像检索(TIR)和零样本文本到图像检索(ZS-TIR)。与基准线相比,我们实现了更优异的性能,并将微调时间减少了很大程度(特别是76.17%)。大量实验和消融研究证明了我们在CAVL中提出的对比预训练和自适应微调的效率。