Self-supervised category-agnostic segmentation of real-world images into objects is a challenging open problem in computer vision. Here, we show how to learn static grouping priors from motion self-supervision, building on the cognitive science notion of Spelke Objects: groupings of stuff that move together. We introduce Excitatory-Inhibitory Segment Extraction Network (EISEN), which learns from optical flow estimates to extract pairwise affinity graphs for static scenes. EISEN then produces segments from affinities using a novel graph propagation and competition mechanism. Correlations between independent sources of motion (e.g. robot arms) and objects they move are resolved into separate segments through a bootstrapping training process. We show that EISEN achieves a substantial improvement in the state of the art for self-supervised segmentation on challenging synthetic and real-world robotic image datasets. We also present an ablation analysis illustrating the importance of each element of the EISEN architecture.
翻译:在计算机视觉中,将真实世界图像分类成物体的自我监督分类(EISEN)是一个具有挑战性的开放问题。在这里,我们展示了如何从运动自我监督的视觉中学习静态组合前科,以Spelke 对象的认知科学概念为基础:共同移动的东西的组合。我们引入了Excititor-Inhisteal Ceptitual Exphationon Network(EISEN),从光学流估计中学习,为静态场景提取双向亲近图形。EISEN然后使用新颖的图表传播和竞争机制从亲近中生成部分。独立运动来源(例如机器人武器)和它们移动的物体之间的相互交错,通过靴式训练过程将分解成不同的部分。我们展示了EISEN在挑战合成和真实世界机器人图像数据集方面实现自我监督分离的艺术状态的显著改善。我们还提出一个对比分析,说明EISEN结构中每个要素的重要性。