While current vision algorithms excel at many challenging tasks, it is unclear how well they understand the physical dynamics of real-world environments. Here we introduce Physion, a dataset and benchmark for rigorously evaluating the ability to predict how physical scenarios will evolve over time. Our dataset features realistic simulations of a wide range of physical phenomena, including rigid and soft-body collisions, stable multi-object configurations, rolling, sliding, and projectile motion, thus providing a more comprehensive challenge than previous benchmarks. We used Physion to benchmark a suite of models varying in their architecture, learning objective, input-output structure, and training data. In parallel, we obtained precise measurements of human prediction behavior on the same set of scenarios, allowing us to directly evaluate how well any model could approximate human behavior. We found that vision algorithms that learn object-centric representations generally outperform those that do not, yet still fall far short of human performance. On the other hand, graph neural networks with direct access to physical state information both perform substantially better and make predictions that are more similar to those made by humans. These results suggest that extracting physical representations of scenes is the main bottleneck to achieving human-level and human-like physical understanding in vision algorithms. We have publicly released all data and code to facilitate the use of Physion to benchmark additional models in a fully reproducible manner, enabling systematic evaluation of progress towards vision algorithms that understand physical environments as robustly as people do.
翻译:虽然目前的视觉算法在许多具有挑战性的任务中非常出色,但目前尚不清楚它们如何很好地理解现实世界环境中的物理动态。 在这里,我们引入了物理,一个数据集和基准,以严格评估预测物理情景随时间演变的能力。我们的数据集具有一系列物理现象的现实模拟,其中包括僵硬和软体碰撞、稳定的多物体配置、滚动、滑动和投影动作,从而提供了比以往基准更全面的挑战。另一方面,我们使用了图形神经网络,直接获取物理状态信息的方式比以往的基准要好得多,并且作出与人类的数据更相似的预测。与此同时,我们在同一套情景中获得了精确的人类预测行为的测量,使我们能够直接评估任何模型如何接近人类的行为。我们发现,那些了解目标中心特征的模型通常比那些并不比人类业绩差得多的模型要差得多。另一方面,能够直接获取物理状态信息的图形网络,其运行效果要好得多,而且其预测也更类似于人类所作的预测。这些结果表明,在相同的情景中提取物理预测的物理预测行为,让我们直接评估任何模型能够很好地估计人类行为的行为。我们发现,在实际的模型中能够完全理解人类的模型,从而实现人类的逻辑水平上,我们完全地理解了人类的模型。