The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. With this motivation, we propose an extensible assessment framework to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change. We provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Initial evaluation results on the base version of GPT-3 (Davinci), showcase subpar performance on these benchmarks.
翻译:大型语言模型(LLMS)最近的进展改变了自然语言处理领域。从GPT-3到PALM,自然语言任务方面的最先进业绩随着每一个新的大型语言模型而向前推进。除了自然语言能力之外,人们对了解这些模型是否具有相当的兴趣,这些模型经过大量的数据和推理能力方面的训练,因此人们很有兴趣为各种推理任务制定基准,测试LLMS对这些基准的初步结果似乎大多是积极的。然而,目前的基准相对简单,这些基准的绩效不能用来作为证据支持,许多时候是超越常规语言模型的推理能力。目前,这些基准只是一套非常有限的简单推理任务,我们需要研究更复杂的推理问题,这样我们才能衡量这种基于LMM系统的真正局限性。我们提出一个极好的评估框架,以测试LMMS在人类情报的核心方面的能力,这是对行动和变化的推理。我们提供了比GPMS推理能力的任何初步推理标准要多的多个测试案例。关于GPPT的每个初步推理学基准和每一阶段的推理判标准。