Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
翻译:从视觉观测中了解动态是一个具有挑战性的问题,需要将单个物体从现场分离出来,并学习它们的相互作用。虽然最近的以物体为中心的模型可以成功地将场景分解成物体,但有效地模拟其动态仍然是一个挑战。我们通过引入SlotFormer -- -- 一种基于变异器的自动递增模型,在有知识的以物体为中心的表达方式上运行。根据视频片段,我们对于物体特性的定位理由,以模拟时空关系,并预测准确的未来物体状态。在本文件中,我们成功地应用SlotFormer对与复杂物体相互作用的数据集进行视频预测。此外,未受监督的SlotFormer的动态模型可以用来改进监督下游任务的业绩,例如视觉问答(VQA)和有目标的调整规划。与以往关于动态模型的工作相比,我们的方法在保持高品质的视觉生成的同时,实现了对物体动态动态动态的长得多的合成。此外,SlotFormer使VQA模型能够解释未来没有物体等级标签的视频。此外,即使没有受监督的SlotFormerFormer Former的动态模型,也能够具体地设计出一种具有竞争力的模型作为世界性的工作。