This paper proposes a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence. A central question is the choice of learning signal in cases of total occlusion. Rather than directly supervising the locations of invisible objects, we propose a self-supervised objective that requires neither human annotation, nor assumptions about object dynamics. We show that object permanence can emerge by optimizing for temporal coherence of memory: we fit a Markov walk along a space-time graph of memories, where the states in each time step are non-Markovian features from a sequence encoder. This leads to a memory representation that stores occluded objects and predicts their motion, to better localize them. The resulting model outperforms existing approaches on several datasets of increasing complexity and realism, despite requiring minimal supervision, and hence being broadly applicable.
翻译:本文提出一个自我监督的学习展示目标,即将被封闭的物体定位于位置上 -- -- 一种称为永久物体的属性。一个中心问题是选择完全封闭情况下的学习信号。我们不直接监督无形物体的位置,而是提出一个不需要人类批注或对物体动态假设的自我监督目标。我们表明,物体永久性可以通过优化记忆的时间一致性而出现:我们把Markov 设置在时空记忆图上,每个步骤的状态都是从序列编码器中产生的非马尔科文特征。这导致存储被封闭的物体并预测其运动的内存信号,以更好地将其定位。由此产生的模型超越了一些日益复杂和现实的数据集的现有方法,尽管需要最低限度的监督,因此广泛适用。