In order to reach human performance on complexvisual tasks, artificial systems need to incorporate a sig-nificant amount of understanding of the world in termsof macroscopic objects, movements, forces, etc. Inspiredby work on intuitive physics in infants, we propose anevaluation benchmark which diagnoses how much a givensystem understands about physics by testing whether itcan tell apart well matched videos of possible versusimpossible events constructed with a game engine. Thetest requires systems to compute a physical plausibilityscore over an entire video. It is free of bias and cantest a range of basic physical reasoning concepts. Wethen describe two Deep Neural Networks systems aimedat learning intuitive physics in an unsupervised way,using only physically possible videos. The systems aretrained with a future semantic mask prediction objectiveand tested on the possible versus impossible discrimi-nation task. The analysis of their results compared tohuman data gives novel insights in the potentials andlimitations of next frame prediction architectures.
翻译:为了在复杂的视觉任务上达到人类的性能,人造系统需要结合对世界的宏观物体、运动、力量等方面的大量了解。 受婴儿直觉物理学工作的启发,我们建议了一个评价基准,通过测试一个特定系统是否能够分辨出与以游戏引擎建造的可能和不可能的事件相匹配的视频,来判断一个特定系统对物理学的了解程度。这项测试需要用系统来计算整个视频上的物理光谱。它没有偏见,可以测试一系列基本的物理推理概念。我们然后描述两个深神经网络系统,目的是以不受监督的方式学习直觉物理学,只使用物理上可能拍摄的视频。这些系统经过训练,将未来使用一个语义面具的预测目标,并测试了可能的和不可能的矛盾性任务。对结果的分析与人类数据进行比较后,就下一个框架预测结构的潜力和局限性提供了新的洞察力。