Progress in self-supervised learning has brought strong general image representation learning methods. Yet so far, it has mostly focused on image-level learning. In turn, tasks such as unsupervised image segmentation have not benefited from this trend as they require spatially-diverse representations. However, learning dense representations is challenging, as in the unsupervised context it is not clear how to guide the model to learn representations that correspond to various potential object categories. In this paper, we argue that self-supervised learning of object parts is a solution to this issue. Object parts are generalizable: they are a priori independent of an object definition, but can be grouped to form objects a posteriori. To this end, we leverage the recently proposed Vision Transformer's capability of attending to objects and combine it with a spatially dense clustering task for fine-tuning the spatial tokens. Our method surpasses the state-of-the-art on three semantic segmentation benchmarks by 17%-3%, showing that our representations are versatile under various object definitions. Finally, we extend this to fully unsupervised segmentation - which refrains completely from using label information even at test-time - and demonstrate that a simple method for automatically merging discovered object parts based on community detection yields substantial gains.
翻译:自我监督学习的进展带来了强大的普通图像描述学习方法。 但到目前为止,它主要侧重于图像水平学习。 反过来, 未经监督的图像分割等任务并没有从这一趋势中受益, 因为它们需要空间多样性的表达方式。 然而, 学习密集的表达方式具有挑战性, 正如在未经监督的情况下, 如何指导模型来学习与各种潜在对象类别相对应的表达方式。 在本文件中, 我们争辩说, 自我监督的物体部分的学习是解决这个问题的一种解决办法。 对象部分是可概括化的: 它们具有先验性, 独立于一个对象定义, 但可以组合成一个后验对象。 为此, 我们利用最近提出的视觉变异器对物体的注意能力, 并将其与空间密集的组合任务结合起来, 以微调空间符号。 我们的方法比三种语义分解基准的状态高出17%- 3%, 表明我们的表现方式在各种对象定义下是多功能化的。 最后, 我们将此范围扩展为完全不受监督的分割方式, 以形成一个后验的物体。 为此, 我们利用最近提出的视觉变异器对物体进行关注的能力, 来完全避免使用自动地将所发现的特性转换成一个区域, 。