Reasoning about the behaviour of physical objects is a key capability of agents operating in physical worlds. Humans are very experienced in physical reasoning while it remains a major challenge for AI. To facilitate research addressing this problem, several benchmarks have been proposed recently. However, these benchmarks do not enable us to measure an agent's granular physical reasoning capabilities when solving a complex reasoning task. In this paper, we propose a new benchmark for physical reasoning that allows us to test individual physical reasoning capabilities. Inspired by how humans acquire these capabilities, we propose a general hierarchy of physical reasoning capabilities with increasing complexity. Our benchmark tests capabilities according to this hierarchy through generated physical reasoning tasks in the video game Angry Birds. This benchmark enables us to conduct a comprehensive agent evaluation by measuring the agent's granular physical reasoning capabilities. We conduct an evaluation with human players, learning agents, and heuristic agents and determine their capabilities. Our evaluation shows that learning agents, with good local generalization ability, still struggle to learn the underlying physical reasoning capabilities and perform worse than current state-of-the-art heuristic agents and humans. We believe that this benchmark will encourage researchers to develop intelligent agents with advanced, human-like physical reasoning capabilities. URL: https://github.com/Cheng-Xue/Hi-Phy
翻译:以物理物体的行为为根据,这是在物理世界中操作的代理人的一种关键能力。人类在物理推理方面经验丰富,但对于AI来说仍然是一项重大挑战。为了便利研究解决这一问题,最近提出了几项基准。然而,这些基准使我们无法测量代理人的颗粒物理推理能力,以完成复杂的推理任务。在本文件中,我们提议了一个物理推理能力的新基准,允许我们测试个体物理推理能力。在人类如何获得这些能力的启发下,我们提议以日益复杂的方式对物理推理能力进行总体等级划分。我们通过在视频游戏《愤怒鸟》中生成物理推理任务,根据这一等级进行基准测试能力。这一基准使我们能够通过测量代理人的颗粒物理推理能力,进行全面的代理评估。我们对人类玩家、学习代理人和超自然剂进行评估,并确定他们的能力。我们的评估表明,学习代理人,具有良好的当地普遍化能力,仍然在学习基本的物理推理能力,并且比当前状态更差。我们相信,我们的基准将鼓励研究人员发展智能的代理人,即高级的MAC/HI/HReximal。