Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
翻译:人类完全精通物理物体行为的推理,并据此选择完成任务的行动,这仍然是大赦国际面临的一项重大挑战。为了便利研究解决这一问题,我们提议一个新的测试台,要求代理人了解物理情景并采取适当行动。受婴儿获得的物理知识以及机器人在现实世界环境中运作所需的能力所启发,我们确定了15种基本物理情景。我们创建了各种各样的不同任务模板,我们确保同一情景中的所有任务模板都能通过使用一种具体的战略物理规则来解决。我们的评价表明,1)所有代理人都远远低于人类绩效,2)学习代理人,即使具有良好的当地通用能力,我们也要与人类玩家、具有不同投入类型和结构的学习代理人以及具有不同战略的超理论代理人进行广泛的评价。受人类智商计算方法的启发,我们用我们所考虑的物理情景来界定物理推理(Phy-Q评分),这反映了一个代理人的物理推理智能。我们的评价表明,1)所有代理人都远远低于人类绩效,2)学习代理人,即使具有良好的本地通用能力,也进行广泛的学习代理人。Q:我们无法学习一般推理学。