To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.
翻译:为了在复杂任务中达到人类的性能,人工系统的关键能力是理解天体之间的物理相互作用,并预测未来情况的结果。这种能力,通常称为直觉物理学,最近受到注意,并提出了从视频序列中学习这些物理规则的几种方法。然而,这些方法大多限于没有或只有有限的分解发生的情况。在这项工作中,我们提出了在3D场景中学习直觉物理学的概率性配方。在我们的设计中,天体位置被模拟为有助于重建场景的潜在变量。我们随后提出了一系列近似图,使这一问题能够被牵引。对象建议是利用一个反复互动网络、空间物理学模型和构像器的组合,在像素空间上进行物体项目模型化。我们在IntPhys的直观物理基准中,我们展示了对状态和状态的显著改进。我们在第二个数据集中采用了一种方法,以不断提高的分解度为基础。我们随后提出了一系列的近似值,使这一问题可以被牵引出。对象的建议是相互连接的,同时使用一个反复互动网络的组合,将物体建成一个空间中的模型,将物体建模模型,将物体建成一个模型的模型的模型,我们最后将显示了30个真实的图像的模拟。