Unsupervised image semantic segmentation(UISS) aims to match low-level visual features with semantic-level representations without outer supervision. In this paper, we address the critical properties from the view of feature alignments and feature uniformity for UISS models. We also make a comparison between UISS and image-wise representation learning. Based on the analysis, we argue that the existing MI-based methods in UISS suffer from representation collapse. By this, we proposed a robust network called Semantic Attention Network(SAN), in which a new module Semantic Attention(SEAT) is proposed to generate pixel-wise and semantic features dynamically. Experimental results on multiple semantic segmentation benchmarks show that our unsupervised segmentation framework specializes in catching semantic representations, which outperforms all the unpretrained and even several pretrained methods.
翻译:非监督图像语义分割(UISS)旨在将低层次的视觉特征与语义层次的表示匹配,而不需要外部监督。在本文中,我们从特征对齐和特征均匀性的角度考虑了UISS模型的关键特性。我们还比较了UISS和整幅图像级别的表示学习。基于分析,我们认为UISS中现有的基于互信息的方法存在表示崩溃的问题。因此,我们提出了一种称为语义注意网络(SAN)的强健网络,在其中提出了新模块Semantic Attention(SEAT),可以动态生成像素级和语义级特征。多个语义分割基准测试的实验结果表明,我们的非监督分割框架专门用于捕获语义表示,比所有未预训练和甚至几个预训练方法都表现更优秀。