Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
翻译:在计算机愿景中,主流方法遵循多阶段框架,并承受高培训成本。在本文件中,我们探索了“对比语言图像预培训模型”将不同类别本地化的潜力,仅使用图像级标签,不经过任何进一步培训。为了高效生成CLIP的高品质分解掩码,我们提议了一个名为“为SSSS提供CLIP-ES”的新框架。我们的框架改进了SSS的所有三个阶段,为CLIP提供了特殊设计:(1) 我们向格拉德卡姆引入软模具功能,并利用CLIP的零弹射能力来抑制非目标类和背景造成的混乱。与此同时,为了充分利用CLIP,我们根据CLIP设置并定制了两种文本驱动战略:基于清晰度的快速选择和同义调解密。为了简化CAM的改进阶段,我们提议在基于我们CIS-LOL的常规培训模块中实时地降低成本。