Many real-world problems require the combined application of multiple reasoning abilities employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, "How much would the sea level rise if all ice in the world melted?" FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.
翻译:许多现实世界问题要求综合应用多种推理能力,采用适当的抽象、常识知识和解决问题战略的创造性合成。为了帮助推进AI系统,我们提出了一个新的推理挑战,即Fermi问题(Fermi Maisses),这些问题的答案只能粗略估计,因为精确的计算是不切实际的或不可能的。例如,“如果世界上所有冰层都融化了,海平面上升需要多少?”Fests通常在测验和访谈中使用详细的解决方案,以显示和评估人类的创造性推理能力。为了对AI系统也这样做,我们提出了两个数据集:(1) 大量来自测验和奥lympiads的1k真实世界的FPs;和(2) 中间复杂问题的10k合成Fests,作为更困难的现实世界挑战的沙箱。除了问答配对外,数据集还包含详细的解决方案,其形式是可执行的方案,支持事实,帮助监督和评估中间步骤。我们展示了两个非常精细的大规模语言模型,它们来自测验和奥秘的测试;以及两个规模的大规模语言模型,因此,在构建这些单一的系统时,我们无法作出一些平均的挑战。