We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to predict the future object states, from which we can then generate subsequent video frames. With the goal of learning meaningful spatio-temporal object representations and accurately forecasting object states, we propose two novel object-centric video predictor (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions, thus presenting an improved prediction performance. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets, while maintaining consistent and accurate object representations.
翻译:我们为以物体为中心的视频预测任务提出了一个新的框架,即提取视频序列的构成结构,以及从视觉观测中模拟物体动态和相互作用,以便预测未来物体状态,然后我们可以从中生成随后的视频框架。为了学习有意义的时空物体表达和准确预测物体状态,我们提出了两个以物体为中心的新颖视频预测器(OCVP)变压器模块,它们分解了时间动态和物体相互作用的处理过程,从而提供了更好的预测性能。在我们的实验中,我们展示了我们的以物体为中心的预测框架如何利用我们的OCVP预测器在两个不同的数据集中优于物体-敏感视频预测模型,同时保持物体的一致和准确的表达方式。