Being able to learn dense semantic representations of images without supervision is an important problem in computer vision. However, despite its significance, this problem remains rather unexplored, with a few exceptions that considered unsupervised semantic segmentation on small-scale datasets with a narrow visual domain. In this paper, we make a first attempt to tackle the problem on datasets that have been traditionally utilized for the supervised case. To achieve this, we introduce a novel two-step framework that adopts a predetermined prior in a contrastive optimization objective to learn pixel embeddings. This marks a large deviation from existing works that relied on proxy tasks or end-to-end clustering. Additionally, we argue about the importance of having a prior that contains information about objects, or their parts, and discuss several possibilities to obtain such a prior in an unsupervised manner. Extensive experimental evaluation shows that the proposed method comes with key advantages over existing works. First, the learned pixel embeddings can be directly clustered in semantic groups using K-Means. Second, the method can serve as an effective unsupervised pre-training for the semantic segmentation task. In particular, when fine-tuning the learned representations using just 1% of labeled examples on PASCAL, we outperform supervised ImageNet pre-training by 7.1% mIoU. The code is available at https://github.com/wvangansbeke/Unsupervised-Semantic-Segmentation.
翻译:能够在没有监督的情况下学习密集的图像的语义表达方式是计算机视觉中的一个重要问题。然而,尽管这个问题很重要,但除了少数例外外,它仍然相当没有被探索到,因为有些例外认为在小型数据集上使用狭小的视觉域。在本文中,我们第一次尝试解决传统上用于受监督案件的数据集问题。为了实现这一点,我们引入了一个新型的两步框架,在对比优化目标中采用先预定的比喻,以学习像素嵌入。这标志着与目前依靠代理任务或终端对终端对终端组合的工程有很大的偏差。此外,我们争论有一个包含关于对象或其部分信息的未经监督的语义分割的先行的重要性,并讨论以不受监督的方式获得这样的先行的可能性。广泛的实验性评估显示,拟议的方法与现有工作相比具有关键优势。首先,学习的像素嵌入式嵌入式可以直接组合成语义组,使用 K- Means 。第二,这种方法可以作为在SMAI 之前的图像分析之前的精密度分析工具,在SLAA 之前,在进行我们所了解的图像分析前的图像分析之前的校程中, 。