In order to interact with the world, agents must be able to predict the results of the world's dynamics. A natural approach to learn about these dynamics is through video prediction, as cameras are ubiquitous and powerful sensors. Direct pixel-to-pixel video prediction is difficult, does not take advantage of known priors, and does not provide an easy interface to utilize the learned dynamics. Object-centric video prediction offers a solution to these problems by taking advantage of the simple prior that the world is made of objects and by providing a more natural interface for control. However, existing object-centric video prediction pipelines require dense object annotations in training video sequences. In this work, we present Object-centric Prediction without Annotation (OPA), an object-centric video prediction method that takes advantage of priors from powerful computer vision models. We validate our method on a dataset comprised of video sequences of stacked objects falling, and demonstrate how to adapt a perception model in an environment through end-to-end video prediction training.
翻译:为了与世界互动,代理商必须能够预测世界动态的结果。了解这些动态的自然方法是通过视频预测,因为相机是无处不在和强大的传感器。直接像素到像素的视频预测很困难,没有利用已知的前科,也没有提供利用所学动态的简单界面。以物体为中心的视频预测利用世界由天体组成和提供更自然的界面来解决这些问题。然而,现有以物体为中心的视频预测管道需要在培训视频序列中进行密集的物体说明。在这项工作中,我们提出以物体为中心的预测,而没有注解(OPA),这是一种以物体为中心的视频预测方法,它利用了强大的计算机视觉模型的前科。我们验证了由堆叠物体的视频序列下降组成的数据集的方法,并展示了如何通过终端到终端视频预测培训在环境中调整感知模型的方法。