从没有根据的形式获得含义的可预见限制:未来语言模式将理解什么? (Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?)

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever "understand" raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of "assertions": contexts within raw text that provide indirect clues about underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation if all expressions in the language are referentially transparent. However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to "understand".

翻译：在数十亿个符号上受过培训的语言模型最近导致了许多NLP任务的前所未有的结果。这一成功提出了这样一个问题:原则上,一个系统能否“理解”原始文本而不能获得某种形式的基础。我们正式调查没有根据的系统获得意义的能力。我们的分析侧重于“保证”的作用:原始文本中的背景,它间接提供了基本语义学的间接线索。我们的研究是,声明是否允许一个系统来模仿保留语义关系的表象。我们发现,如果语言中的所有表达方式都具有优先透明度,那么这种声明就能够进行语义模拟。但是,如果语言使用非透明模式,例如变式约束,我们表明模拟可能成为一个无法解释的问题。最后,我们讨论我们的正式模式和自然语言之间的差别,探讨我们的结果如何概括到模型设置和其他语义关系。我们的结果共同表明,以代码或语言表示的主张并不能提供足够的信号来完全模仿语义学的表述。我们正式确定,在哪些方面没有根据的语言模型似乎从根本上限制了它们“站立 ” 的能力。