Despite their irresistible success, deep learning algorithms still heavily rely on annotated data. On the other hand, unsupervised settings pose many challenges, especially about determining the right inductive bias in diverse scenarios. One scalable solution is to make the model generate the supervision for itself by leveraging some part of the input data, which is known as self-supervised learning. In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. In addition to disentangling the notion of objects and the motion dynamics, our compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. With the aid of auxiliary loss functions that promote spatially and temporally consistent object representations, our self-supervised framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
翻译:尽管取得了不可抗拒的成功,深层次的学习算法仍然在很大程度上依赖附加说明的数据。另一方面,未经监督的设置提出了许多挑战,特别是在确定不同情景中正确的导导偏差方面。一个可扩展的解决办法是利用输入数据的某些部分,使模型产生自我监督,这种输入数据被称为自监督的学习。在本文中,我们把预测任务作为自我监督的视野,并且为图像序列表达建立一个新的以物体为中心的物体中心模型。除了分离物体的概念和运动动态外,我们的构成结构还明确处理隐蔽和插入为预测框架的构成推断对象和背景。在辅助性损失功能的帮助下,促进空间和时间上一致的物体表达,我们的自我监督框架可以在没有任何手动说明或预先训练的网络的帮助下得到培训。初步实验证实,拟议的管道对于以物体为中心的视频预测来说是一个有希望的步骤。