Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zeroshot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a classagnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter task performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 22 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.
翻译:零点语义分解( ZS3 ) 旨在分割未在培训中看到的新型类别 。 现有的作品将 ZS3 编成像素级零分分类问题, 并在语言模型的帮助下将语义知识从可见类类分解到看不见类, 只对文本进行预培训。 简单的是, 像素级 ZS3 的配方显示, 整合通常经过图像文本配对培训、 目前展示了巨大视觉任务潜力的视觉语言模型的能力有限 。 受到以下观察的启发: 人类经常执行分级语义标签, 我们建议将 ZS3 调成两个子类分解:(1) 将像素类分类任务从可见类分解到看不见类分解。 2 部分的零点分类任务不包含分类信息, 并且可以直接转换到组级。 后一项任务在分级一级进行, 并且提供了一种自然的方式, 将大型视觉语言模型在图像级分类中进行, (e. g. ocleclex) 级级分解 Z 上, 高级 Sdeal- dequal rude rudeal- squal- suplemental 。