CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features without deliberative adaptations. In this work, we first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation. Concretely, MVP-SEG deliberately learns multiple prompts trained by our Orthogonal Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP feature on different object parts, and collaborative segmentation masks generated by all prompts promote better segmentation. Moreover, MVP-SEG introduces Global Prompt Refining (GPR) to further eliminate class-wise segmentation noise. Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories, and MVP-SEG+ which combines the knowledge transfer stage significantly outperforms previous methods on several benchmarks. Moreover, qualitative results justify that MVP-SEG does lead to better focus on different local parts.
翻译:CLIP(对比度语言-图像预训练)在开放词汇的零样本图像级别识别方面得到了很好的发展,而它在像素级任务中的应用研究较少,大部分工作直接采用CLIP特征而没有深思熟虑的适应。在这项工作中,我们首先证明了图像像素级别的CLIP特征适应的必要性,然后提供了多视角提示学习(MVP-SEG)作为一个有效的解决方案,实现了图像像素级别的适应,并解决了开放词汇语义分割问题。具体来说,MVP-SEG有意地学习了多个由正交约束损失(OCLoss)训练的提示,每个提示都有监督地利用不同的目标部分的CLIP特征,所有提示协作生成的分割掩模促进了更好的分割。此外,MVP-SEG引入了全局提示细化(GPR)来进一步消除类别间的分割噪声。实验表明,从已知类别学习的多视角提示具有很强的泛化性,MVP-SEG +结合知识迁移阶段显著优于先前的方法在几个基准测试中。此外,定性结果证明MVP-SEG确实能够更好地关注不同的局部部分。