Contrastive, self-supervised learning of object representations recently emerged as an attractive alternative to reconstruction-based training. Prior approaches focus on contrasting individual object representations (slots) against one another. However, a fundamental problem with this approach is that the overall contrastive loss is the same for (i) representing a different object in each slot, as it is for (ii) (re-)representing the same object in all slots. Thus, this objective does not inherently push towards the emergence of object-centric representations in the slots. We address this problem by introducing a global, set-based contrastive loss: instead of contrasting individual slot representations against one another, we aggregate the representations and contrast the joined sets against one another. Additionally, we introduce attention-based encoders to this contrastive setup which simplifies training and provides interpretable object masks. Our results on two synthetic video datasets suggest that this approach compares favorably against previous contrastive methods in terms of reconstruction, future prediction and object separation performance.
翻译:对物体表示方式的自我监督的学习最近逐渐成为重建培训的一种有吸引力的替代方法。先前的做法侧重于对比个别物体表示方式(比例),但这一方法的一个基本问题是,总体对比性损失对于(一)代表每个时段的不同物体,如同(二)(重新)代表所有时段的相同物体一样,对于(二)(重新)代表所有时段的相同物体。因此,这一目标并非内在地推动在时段出现以物体为中心的表示方式。我们通过引入一种全球性的、以定点为基础的对比性损失来解决这一问题:我们不将单个时段表示方式对立起来,而是将合并的时段表示方式对立起来。此外,我们采用基于注意的编码器来对付这种截断培训和提供可解释的物体面具的对比性结构。我们关于两个合成视频数据集的结果表明,在重建、未来预测和物体分离性能方面,这种方法与以往的对比性方法相比是有利的。