Unsupervised semantic segmentation requires assigning a label to every pixel without any human annotations. Despite recent advances in self-supervised representation learning for individual images, unsupervised semantic segmentation with pixel-level representations is still a challenging task and remains underexplored. In this work, we propose a self-supervised pixel representation learning method for semantic segmentation by using visual concepts (i.e., groups of pixels with semantic meanings, such as parts, objects, and scenes) extracted from images. To guide self-supervised learning, we leverage three types of relationships between pixels and concepts, including the relationships between pixels and local concepts, local and global concepts, as well as the co-occurrence of concepts. We evaluate the learned pixel embeddings and visual concepts on three datasets, including PASCAL VOC 2012, COCO 2017, and DAVIS 2017. Our results show that the proposed method gains consistent and substantial improvements over recent unsupervised semantic segmentation approaches, and also demonstrate that visual concepts can reveal insights into image datasets.
翻译:在这项工作中,我们建议使用从图像中提取的视觉概念(即具有语义含义的像素组,如部件、物体和场景)来为语义分解设置一个标签。尽管在自我监督的个人图像代言学习方面最近有所进展,但未经监督的像素分解与像素级代表之间的三种关系,包括像素和地方概念之间的关系、地方和全球概念之间的关系,以及各种概念的共同发生。我们通过使用从图像中提取的视觉概念(即具有语义含义的像素组,如部件、对象和场景)来评估三种像素分解的像素代言学习方法。为了指导自我监督的学习,我们利用了三种类型的像素和概念之间的关系,包括像素和地方概念之间的关系,以及各种概念的共同发生。我们评估了三种数据集上学习的像素嵌入和视觉概念,包括PACAL VOC 2012、CO 2017 和 DAVIS 2017。我们的结果显示,拟议的方法在近期的未经监督的语义分解方法上取得了一致和实质性的改进,还表明视觉概念可以显示对图像的洞视。