Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).
翻译:视频通常捕捉物体、其可见属性、其运动和不同物体之间的相互作用。 物体还具有质量等物理属性, 成像管道无法直接捕捉。 然而, 这些属性可以通过使用相对物体运动的提示和碰撞带来的动态来估计。 在本文中, 我们引入了CRIPP- VQA, 一个新的视频问题解答数据集, 用于说明物体在现场的隐含物理属性的推理。 CIOPP- VQA 包含运动物体的视频, 附加关于行动效果的反事实推理的问题, 关于实现目标的规划的问题, 以及关于物体可见特性的说明性的问题。 CRIPP- VQA 测试组可以在几种分布外的设置下进行评估, 即与质量、摩擦系数以及培训分发中未观察到的初始速度的物体的视频。 我们的实验揭示了在回答关于隐含特性( 本文的重点) 和物体明确特性( 先前工作的重点) 的问题方面存在惊人和显著的性能差距。