Linguistic representations derived from text alone have been criticized for their lack of grounding, i.e., connecting words to their meanings in the physical world. Vision-and-Language (VL) models, trained jointly on text and image or video data, have been offered as a response to such criticisms. However, while VL pretraining has shown success on multimodal tasks such as visual question answering, it is not yet known how the internal linguistic representations themselves compare to their text-only counterparts. This paper compares the semantic representations learned via VL vs. text-only pretraining for two recent VL models using a suite of analyses (clustering, probing, and performance on a commonsense question answering task) in a language-only setting. We find that the multimodal models fail to significantly outperform the text-only variants, suggesting that future work is required if multimodal pretraining is to be pursued as a means of improving NLP in general.
翻译:仅从文本中得出的语言表述因缺乏依据而受到批评,即将文字与其在物质世界中的含义联系起来,因此缺乏依据。 联合培训文字和图像或视频数据的视觉语言(VL)模型是对这种批评的一种回应。然而,虽然VL预培训在视觉问题回答等多式任务上表现出成功,但尚不知道内部语言表述本身如何与其仅有文本的对应方进行比较。本文件比较了通过VL与仅供文本的文本相比所学到的两种最新的VL模型的语义表述,在语言专用环境下使用一套分析(组合、演示和常见问题回答任务方面的表现)。我们发现,多式模型未能大大超越仅供文本的变式,这表明,如果要将多式培训作为总体改进NLP的一种手段,还需要今后开展工作。