Recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models exhibit reasoning capabilities with the use of reasoning benchmarks. However, even though results are seemingly positive, these benchmarks prove to be simplistic in nature and the performance of LLMs on these benchmarks cannot be used as evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. Further, these only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. Motivated by this, we propose an extensible assessment framework to test the capabilities of LLMs on reasoning about actions and change, a central aspect of human intelligence. We provide multiple test cases that are more involved than any of the previously established benchmarks and each test case evaluates a different aspect of reasoning about actions and change. Results on GPT-3 (davinci), Instruct-GPT3 (text-davinci-002) and BLOOM (176B), showcase subpar performance on such reasoning tasks.
翻译:最近大型语言模型(LLMs)的进展已经改变了自然语言处理(NLP)领域。从GPT-3到PaLM,自然语言任务的最新性能正在不断提高。除了自然语言能力外,人们也对LLMs是否展现出具有推理能力的兴趣。然而,尽管结果似乎是积极的,但这些基准证明其具有简单化的性质,并且LLMs在这些基准上的表现不能用作支持LLMs推理能力的往往荒谬的声明的证据。此外,它们仅代表一组非常有限的简单推理任务,如果我们要衡量此类基于LLM的系统的真正限制,我们需要查看更复杂的推理问题。在此基础上,我们提出了一个可扩展的评估框架,用于测试LLMs对于操作和变化的推理能力,这是人类智能的一个核心方面。我们提供了多个更复杂的推理任务的测试案例,每个测试案例评估操作和变化推理的不同方面。在GPT-3(davinci)、Instruct-GPT3(text-davinci-002)和BLOOM(176B)上的结果显示,在此类推理任务上的表现不佳。