Thanks to the impressive progress of large-scale vision-language pretraining, recent recognition models can classify arbitrary objects in a zero-shot and open-set manner, with a surprisingly high accuracy. However, translating this success to semantic segmentation is not trivial, because this dense prediction task requires not only accurate semantic understanding but also fine shape delineation and existing vision-language models are trained with image-level language descriptions. To bridge this gap, we pursue \textbf{shape-aware} zero-shot semantic segmentation in this study. Inspired by classical spectral methods in the image segmentation literature, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features to promote shape-awareness. Despite that this simple and effective technique does not make use of the masks of seen classes at all, we demonstrate that it out-performs a state-of-the-art shape-aware formulation that aligns ground truth and predicted edges during training. We also delve into the performance gains achieved on different datasets using different backbones and draw several interesting and conclusive observations: the benefits of promoting shape-awareness highly relates to mask compactness and language embedding locality. Finally, our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO, with significant margins. Code and models will be accessed at https://github.com/Liuxinyv/SAZS.
翻译:随着大规模视觉语言预训练的惊人进展,最近的识别模型可以以零样本和开放式方式对任意对象进行分类,并且准确率惊人地高。然而,将这一成功模型应用于语义分割并不简单,因为这种密集的预测任务不仅需要精确的语义理解,还需要精细的形状描绘,而现有的视觉语言模型是通过图像级别的语言描述进行培训的。为了弥合这一差距,本研究追求 \textbf{具有形状感知的} 零样本语义分割。受图像分割文献中经典的谱方法的启发,我们建议利用自监督像素级特征构建的拉普拉斯矩阵的特征向量来促进形状感知。尽管这种简单而有效的技术根本不使用已知类别的掩码,但我们证明它优于状态-of-the-art 的形状感知公式,该公式在训练过程中对齐了地面实况和预测边缘。我们还深入研究了在不同数据集上使用不同支持骨干网络所实现的性能提升,并得出了几个有趣而明确的结论:促进形状感知的好处与掩码紧凑性和语言嵌入局部性高度相关。最后,我们的方法在Pascal和COCO的零样本语义分割方面取得了新的最高性能,差距显著。代码和模型将在https://github.com/Liuxinyv/SAZS中访问。