In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.
翻译:为了在复杂的视觉任务上达到人类的性能,人工系统需要纳入对世界在宏观物体、运动、力量等方面的大量了解。 在婴儿直观物理学工作的启发下,我们提议了一个评价框架,通过测试某一系统能否分辨出相匹配的可能事件和不可能发生的事件的视频,来判断一个特定系统对物理学的了解程度。测试要求系统计算整个视频的物理可视性分数。它没有偏见,可以测试一系列具体的物理推理技能。然后我们用游戏引擎制作的视频描述首次发布基准数据集,目的是以不受监督的方式学习直观物理学。我们描述了两个经过未来框架预测目标培训的深神经网络基线系统,并测试了可能的与不可能的区别任务。对结果与人类数据的分析,使人们对下一个框架预测结构的潜力和局限性有了新的洞察力。