Large language models (LLMs) have demonstrated strong performance in zero-shot reasoning tasks, including abductive reasoning. This is reflected in their ability to perform well on current benchmarks in this area. However, to truly test the limits of LLMs in abductive reasoning, a more challenging benchmark is needed. In this paper, we present such a benchmark, consisting of 191 long-form mystery stories, each approximately 1200 words in length and presented in the form of detective puzzles. Each puzzle includes a multiple-choice question for evaluation sourced from the "5 Minute Mystery" platform. Our results show that state-of-the-art GPT models perform significantly worse than human solvers on this benchmark, with an accuracy of 28\% compared to 47\% for humans. This indicates that there is still a significant gap in the abductive reasoning abilities of LLMs and highlights the need for further research in this area. Our work provides a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.
翻译:大型语言模型(LLMs)在零点推理任务(包括绑架推理)中表现良好,这体现在它们有能力很好地执行目前这一领域的基准。然而,为了真正测试绑架推理中的LLMs的局限性,需要一个更具挑战性的基准。在本文中,我们提出了这样一个基准,由191个长方形的神秘故事组成,每1,200个单词的长度大约1,并以侦探拼图的形式提出。每个谜题都包括来自“5分钟神秘”平台的多选题评估问题。我们的结果显示,最先进的GPT模型在这个基准上的表现比人类解算算器差得多,精确度为28 ⁇,而人类为47 ⁇ 。这表明LMs的绑架推理能力仍然有很大差距,并强调需要在这一领域进行进一步研究。我们的工作为今后对语言模型的推理工作提供了一个具有挑战性的基准,有助于更好地了解LMs能力的局限性。