Large-scale pretrained language models (LMs) are said to ``lack the ability to connect [their] utterances to the world'' (Bender and Koller, 2020). If so, we would expect LM representations to be unrelated to representations in computer vision models. To investigate this, we present an empirical evaluation across three different LMs (BERT, GPT2, and OPT) and three computer vision models (VMs, including ResNet, SegFormer, and MAE). Our experiments show that LMs converge towards representations that are partially isomorphic to those of VMs, with dispersion, and polysemy both factoring into the alignability of vision and language spaces. We discuss the implications of this finding.
翻译:据说,大规模预先培训的语言模型“缺乏将其言论与世界语言模型(Bender和Koller,2020年)连接起来的能力。如果是这样,我们预计LM的表述与计算机视觉模型中的表述无关。为了调查这一点,我们提出对三个不同的LM(BERT、GPT2和OTP)和三个计算机视觉模型(VMs,包括ResNet、SegFormer和MAE)的经验性评估。我们的实验表明,LMs的表述与VM的表述部分不相容,其分布和多元性都考虑到视觉和语言空间的可调和性。我们讨论了这一结论的影响。