Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.
翻译:最近大型反语言图像培训前(CLIP)的成功,通过将图像-文字对齐知识转换成像像素级分类,在零发语义分解方面带来了巨大的希望。然而,现有方法通常需要额外的图像编码器或再培训/调整 CLIP 模块。在这里,我们提出了一个具有成本效益的战略,使用文本-快速学习,使整个CLIP模块在充分利用其丰富信息的同时保持冻结。具体地说,我们提议一种与最佳运输工具(Zegot)进行零发分解的新办法,将多文本与冻结图像嵌入相匹配,使每个文本能够快速有效地聚焦于具体的语义属性。此外,我们提议了深地本地功能对齐,使文本与冻结图像编码层的中间本地特征紧密一致,大大提升零光分解性性性性。我们提议了一种与基准数据集(Zegot)的广泛实验,我们显示我们的方法实现了与前SOTA相比,只有x7较轻的参数的状态-艺术(SOATA)性能。