Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available at https://github.com/chongzhou96/MaskCLIP.
翻译:培训前语言与语言差异性语言图象在公开读音零点图像识别方面取得了显著突破。许多最近的研究利用了经过预先训练的CLIP模型进行图像等级分类和操纵。在本文件中,我们希望审查CLIP对像素级密集度预测的内在潜力,特别是在语义分块中。为此,只要稍作修改,我们显示MuscLIP在没有说明和微调的情况下,在不同数据集的开放概念上产生了令人信服的分化结果。通过添加假标签和自我培训,MuscCLIP+超过SOTA的转导零点断层分解方法,大边边段,例如,PACAL VOC/PASCAL环境/CO Stuff的隐蔽班MIOU的内在潜力。从35.6/20/30.3到86.1/66.7/54.7.7。我们还测试了输入腐败下的MAskCLIP的稳健性,并评价了它在区分微重对象和新概念方面的能力。我们发现,MAskLIP-CLIP+TATTA的过渡源码/CLAS 实现可靠监控。