Recently, zero-shot image classification by vision-language pre-training has demonstrated incredible achievements, that the model can classify arbitrary category without seeing additional annotated images of that category. However, it is still unclear how to make the zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. It is difficult because semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy on processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform zero-shot classification on the masked image crops which are generated in the first stage. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate the future research.
翻译:最近,通过视力语言培训前培训前的零光图像分类展示了令人难以置信的成就,模型可以将任意性分类而无需看到该类别附加附加说明的图像。 然而,如何使零光识别在更广泛的视觉问题上(如物体探测和语义分割)运作良好,目前还不清楚如何使零光识别在物体探测和语义分割等更广泛的视觉问题上发挥良好的作用。 在本文件中,我们的目标是零光语义分解,将其建在现成的预先培训的视觉语言模型上,即CLIP。 由于语义分解和CLIP模型在不同视觉颗粒上进行分解,因此很难对像素进行语义分解过程,而CLIP则在图像上进行这种分解过程。为了纠正处理颗粒性方面的差异,我们拒绝使用流行的1阶段FCN框架,并倡导一个两阶段的语义分解框架,第一阶段将提取通用的遮罩建议,第二阶段利用基于图像的CLIP模型对在第一阶段生成的遮蔽图像作物进行零光分类,我们实验结果显示,这个简单的框架将比重的2012年5年5月1月5号的数据比值比值比值比值。