Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark
翻译:人类完全精通物理物体行为的推理,并据此选择完成任务的行动,这仍然是大赦国际面临的一项重大挑战。为了便利研究解决这一问题,我们提议一个新的测试台,要求代理人了解物理情景并采取适当行动。受婴儿获得的物理知识和机器人在现实世界环境中运作所需的物理知识的启发,我们确定15种基本物理情景。对于每一种情景,我们创建了各种不同的任务模板,我们确保同一情景中的所有任务模板都能通过使用一种具体的战略物理规则来解决。我们的评价表明,1)所有代理人都远远低于人类绩效,2)学习代理人,即使具有良好的当地一般化能力,我们也与人类玩家、具有不同投入类型和结构的学习代理人以及具有不同战略的超自然剂进行广泛的评价。根据人类智商的计算,我们定义了反映代理人物理推理智慧的物理推理(Phy-Q分)。我们的评价表明,1)所有代理人都远远低于人类绩效,2)学习代理人,即使具有良好的当地一般化能力,我们也要与人类行为者进行广泛的评价,并努力学习物理推理学/基础。我们鼓励一般推理学水平。