Comphy:视频中物体和活动的构成物理原因 (ComPhy: Compositional Physical Reasoning of Objects and Events from Videos)

Objects' motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object's visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world, whereas humans can effortlessly infer them with limited observations. Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction. In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos. Evaluation results of several state-of-the-art video reasoning models on ComPhy show unsatisfactory performance as they fail to capture these hidden properties. We further propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution into a unified framework. CPL can effectively identify objects' physical properties from their interactions and predict their dynamics to answer questions.

翻译：自然物体的动作由复杂的相互作用及其特性来调节。虽然有些属性,例如形状和材料,可以通过物体的视觉外观来识别, 其它属性, 如质量和电荷等, 并不直接可见。可见和隐藏的属性之间的构成性对AI模型从物理世界的角度来解释, 人类可以不费力地用有限的观察来推断这些模型。现有的视频推理研究主要侧重于视觉可见元素, 如物体外观、移动和接触互动。在本文中, 我们迈出了第一步, 通过引入合成物理解释( ComPhy) 数据集, 来突出推断无法从视觉外观直接可见的隐蔽物理属性的重要性。对于一组特定对象, 可见和隐藏的物理解释( ComPhy) 数据集, 包括它们在不同初始条件下移动和互动的少量视频。模型的评估依据其解析结构, 如质量和收费, 并用这些知识解答在其中一个视频视频上公布的一系列问题。在ComPhy 上, 一些状态的视频推理模型显示不令人满意的性表现, 因为它们无法有效地捕捉取其物理预测, 我们提议了一个视觉预测框架。