The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.
翻译:大型背景语言模型的成功吸引了人们极大的兴趣来验证其表达形式中所编码的内容。 在这项工作中,我们考虑了一个新问题:具体名词的背景表述在多大程度上与相应的直观表述相一致?我们设计了一个测试模型,评估只使用文本的表述在区分匹配和非匹配的直观表述方面的有效性。我们的研究结果表明,语言表述本身为从正确的对象类别中检索图像补丁提供了强烈的信号。此外,它们有效地检索了具体的图像补丁实例;文本背景在这一过程中发挥了重要作用。视觉基础语言模型在实际检索中略高于只使用文本的语言模型,但非常不完善的人。我们希望我们的分析能够激发对理解和改进语言模型视觉能力的今后研究。