Recently, open-vocabulary image classification by vision language pre-training has demonstrated incredible achievements, that the model can classify arbitrary categories without seeing additional annotated images of that category. However, it is still unclear how to make the open-vocabulary recognition work well on broader vision problems. This paper targets open-vocabulary semantic segmentation by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy in processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform open-vocabulary classification on the masked image crops which are generated in the first stage. Our experimental results show that this two-stage framework can achieve superior performance than FCN when trained only on COCO Stuff dataset and evaluated on other datasets without fine-tuning. Moreover, this simple framework also surpasses previous state-of-the-arts of zero-shot semantic segmentation by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate future research. The code are made publicly available at~\url{https://github.com/MendelXu/zsseg.baseline}.
翻译:最近,通过视觉语言培训前的开放式语言图像分类 3 展示了令人难以置信的成就,该模型可以在不看到该类别附加附加说明的图像的情况下对任意分类进行分类,然而,目前还不清楚如何使开放式语言识别在更广泛的视觉问题上发挥良好的作用。 本文的目标是开放语言语义的语义分解, 将其建在现成的预先培训的视觉语言模型上, 即 CLIP。 然而, 语义分解和 CLIP 模型在不同视觉颗粒上运行, 在像素上的语义分解过程, 而 CLIP 则在图像上运行。 然而, 为了纠正处理颗粒中的差异, 我们拒绝使用流行的一阶段FCN 框架, 并倡导两阶段语义的语义分解框架, 其第一阶段是提取通用的面面面面语义建议, 第二阶段是利用基于 CLIP 模型对面面面语系图像进行公开分类。 我们的实验结果显示, 这个二阶段框架可以实现更高的性性能, 而不是FCN 。