While machine learning algorithms excel at many challenging visual tasks, it is unclear that they can make predictions about commonplace real world physical events. Here, we present a visual and physical prediction benchmark that precisely measures this capability. In realistically simulating a wide variety of physical phenomena -- rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion -- our dataset presents a more comprehensive challenge than existing benchmarks. Moreover, we have collected human responses for our stimuli so that model predictions can be directly compared to human judgments. We compare an array of algorithms -- varying in their architecture, learning objective, input-output structure, and training data -- on their ability to make diverse physical predictions. We find that graph neural networks with access to the physical state best capture human behavior, whereas among models that receive only visual input, those with object-centric representations or pretraining do best but fall far short of human accuracy. This suggests that extracting physically meaningful representations of scenes is the main bottleneck to achieving human-like visual prediction. We thus demonstrate how our benchmark can identify areas for improvement and measure progress on this key aspect of physical understanding.
翻译:虽然机器学习算法在很多具有挑战性的视觉任务中非常出色,但尚不清楚它们能否预测出共同的、真实的世界物理事件。在这里,我们提出了一个视觉和物理预测基准,精确地测量了这种能力。在现实地模拟各种物理现象时,我们的数据集比现有基准更具有全面性。此外,我们收集了人类对我们的模拟反应,以便模型预测可以直接与人类的判断相比较。我们比较了各种算法 -- -- 其结构、学习目标、投入-产出结构和培训数据 -- -- 其进行不同物理预测的能力。我们发现,获得物理状态的最佳捕捉人类行为的图形神经网络,而在只接收视觉输入的模型中,那些有以物体为中心的表现或训练前的模型是最好的,但远远低于人类的准确性。这说明,对场景进行实际有意义的描述是取得人类视觉预测的主要瓶颈。因此,我们证明我们的基准可以确定改进的领域,衡量实际理解这一关键方面的进展。