Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
翻译:以物体为中心的表示方式是一个更加系统化的路径,它提供了灵活的抽象模型,可以据以构建组成世界模型。最近关于简单 2D 和 3D 数据集的工作表明,具有以物体为中心的导导偏偏偏的模型可以学习分解,并单独代表数据统计结构中有意义的对象,而无需任何监督。然而,这种完全不受监督的方法仍然无法向多种现实数据扩展,尽管使用了日益复杂的感应偏差,例如天体大小的前视或场景的3D几何学。在本文件中,我们采取一种弱视方式,侧重于1)使用以光学流为形式的视频数据的时间动态,2)在简单对象位置的提示上调整模型,可以使分解和跟踪物体,而无需任何监督,而无需任何监督。我们为预测光流以现实的合成场景或场景的3D几何形状来显示这一模型的初始状态。在本文件中,我们采取了一种弱小的感应变方法,例如第一边框的物体中心,并侧重于1) 使用以光学流流流流流流流流的时,能够大大改进图像的交互界面。在新的顺序中,这些一般的效益的好处可以用来进行新的分析,在新的排序中找到。