Recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models exhibit reasoning capabilities with the use of reasoning benchmarks. However, even though results are seemingly positive, these benchmarks prove to be simplistic in nature and the performance of LLMs on these benchmarks cannot be used as evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. Further, these only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. Motivated by this, we propose an extensible assessment framework to test the capabilities of LLMs on reasoning about actions and change, a central aspect of human intelligence. We provide multiple test cases that are more involved than any of the previously established benchmarks and each test case evaluates a different aspect of reasoning about actions and change. Results on GPT-3 (davinci), Instruct-GPT3 (text-davinci-002) and BLOOM (176B), showcase subpar performance on such reasoning tasks.
翻译:大型语言模型(LLMS)最近的进展改变了自然语言处理领域。从GPT-3到PALM,自然语言任务方面的最先进的表现正在随着每一个新的大型语言模型推展。与自然语言能力一起,人们非常有兴趣了解这些模型是否展示了运用推理基准的推理能力。然而,尽管这些基准看起来是积极的,但这些基准在性质上是简单化的,而LLMS在这些基准上的绩效不能用来作为证据支持,很多次是外向性的,声称LLMS的推理能力。此外,这些只是一套非常有限的简单推理任务,如果我们要衡量这种基于LLMM系统的真正局限性,我们需要研究更复杂的推理问题。我们为此提出一个可扩展的评估框架,以测试LMS在行动和变化的推理能力方面的能力,这是人类情报的一个核心方面。我们提供了比以前制定的基准中的任何基准和每个测试案例都涉及更多的多个测试案例,并且每个案例都评估了对LMS的推理行动和改变的不同方面。关于GPM-3的推理(DMST-Bstal-Sl)和BSirimalgal-Suply(GPLO-PT)3,B-Sleval-B-BSleval-Slental-Slalxxxxxxxxxxxxxx。