Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO ($\textbf{S}$elf-supervised $\textbf{T}$ransformer with $\textbf{E}$nergy-based $\textbf{G}$raph $\textbf{O}$ptimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is a novel contrastive loss function that encourages features to form compact clusters while preserving their relationships across the corpora. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff ($\textbf{+14 mIoU}$) and Cityscapes ($\textbf{+9 mIoU}$) semantic segmentation challenges.
翻译:不受监督的语义分割区旨在发现和本地化图像组合体中的具有语义意义的语言类别,而没有任何形式的注释。 要解决这个问题, 算法必须生成每个像素的特性, 这些象素既具有语义意义, 也具有压缩性, 足以形成不同的组群。 与以往以单一端对端框架实现这一点的作品不同, 我们提议将特性学习与集束压缩区分开来。 我们随机地显示, 当前未经监督的功能学习框架已经产生了高密度的特性, 其相关性在语义上是一致的。 观察促使我们设计STEGO ($\ textbf{S}$elf- 超额 $\ textbf{T}$@$nergy- 基础的 $\ textbf{G}$$raph $\\\\\\ textbff{O} $\\\ papimimiz, 一个将不受监督的特性转化为高品质的离立性语言标签。 STEGO 核心是一个新型的对比损失功能损失功能功能功能, 它会鼓励形成精细的缩组合, 同时在前的艺术上保持它们的图像。