Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.
翻译:在本文中,我们试图评估:(1) 在普通网络公司培训的LLM公司中,LLM公司所谓的突发推理能力令人感兴趣,我们准备调查其规划能力,我们的目标是评估(1) LLM公司本身在创造和验证普通规划任务(人类通常相当擅长的那类)的简单计划方面有多优秀,(2) LLM公司如何成为其他代理机构 -- -- 无论是AI规划者还是人类规划者 -- -- 规划任务中勤奋指导的来源。为了系统地而不是以传闻的方式调查这些问题,我们开始根据国际规划竞争中采用的领域类型制定一套基准套。关于这个基准,我们用三种模式评估LMS公司:自主、超自然和人类流动。我们的结果表明,LM公司自主产生可执行计划的能力相当微弱,平均只有3%的成功率。超自然论和人类流动模式显示了更多的希望。除了这些结果外,我们还提供了支持研究界调查的基准和评价工具。