Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.
翻译:视觉(图像和视频)-语言(VL)预训练是最近流行的范式,它在多模态任务上取得了最先进的结果,例如图像检索、视频检索、视觉问答等。这些模型是以无监督的方式训练的,并且从互补的模态监督中获得了很大的好处。在本文中,我们探讨了使用视觉监督训练的语言表示是否比纯文本表示在自然语言理解和常识推理基准测试上表现更好。我们尝试了多种图像-文本模型,如ALBEF、BLIP、METER以及视频-文本模型,如ALPRO、Frozen-in-Time(FiT)、VIOLET。我们将这些模型的独立文本编码器的语言表示与通过视觉监督学习的文本编码器的语言表示的性能进行比较。我们的实验表明,大多数任务的纯文本表示表现更为优异。这些结果揭示了目前视觉-语言模型的一些缺点。