Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real world-datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.
翻译:人类自然地将其环境分解成适当程度的抽象实体,以便在世界上采取行动。允许机器学习算法以不受监督的方式得出这种分解已成为一个重要的研究线。然而,目前的方法仅限于模拟数据,或要求以运动或深度的形式提供更多信息,以成功发现物体。在这项工作中,我们克服了这一限制,显示从自我监督方式训练的模型中重建特征是充分的培训信号,使以物体为中心的表现以完全不受监督的方式出现。我们的方法DINOSAUR大大超越了模拟数据的现有以物体为中心的学习模型,并且是第一个不受监督的以物体为中心的模型,可以用来衡量真正的世界数据,例如COCO和PASAL VOC。DINOSAUR在概念上是简单易懂的,其表现与计算机视觉文献中更多涉及的管道相比具有竞争性。