Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.
翻译:视觉(图像和视频) - 语言(VL)预培训是最近流行的范例,在图像检索、视频检索、视觉问答等多模式任务上取得了最先进的成果。这些模型受到未经监督的培训,并极大地受益于互补模式监督。在本文中,我们探索使用视觉监督的经过培训的语言表现是否比用香草语言表现在自然语言理解和常识推理基准方面的表现更好。我们试验了一套不同的图像文本模型,如ALBEF、BLIP、METER和视频文本模型,如ALPRO、Frozen-Time(FIT)、VIOLET。我们将这些模型独立文本编码器的语文表现与通过视觉监督学习的文本编码器的语言表现进行比较。我们的实验表明,香草语言表现在大部分任务上表现优异。这些结果揭示了目前各种视觉语言模型的缺陷。