ZegCLIP: 努力调整用于零弹射语义分割的 CLIP (ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation)

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.

翻译：最近, CLIP 应用到像素级的零点学习任务中。通常的想法是首先生成类级区域建议, 然后将作物化建议区域输入到 CLIP 中, 以利用其图像级零点分类能力。虽然这个计划有效, 但需要两个图像解码器, 一个用于生成建议, 一个用于 CLIP, 导致一个复杂的管道和高计算成本。在这项工作中, 我们追求一个简单高效的一阶段解决方案, 将 CLIP 的零点预测能力从图像到像素级。我们的调查从一个直接的扩展开始, 作为我们的基线, 通过比较从 CLIP 中提取的文本和补丁嵌入的相似性来生成语义掩码。然而, 这样的模式可能大大超过所看到的分类, 并且无法概括到不可见的类。为了解决这个问题, 我们建议三个简单但有效的设计, 并显示它们能够大大保留 CLIP 的内在零点输出能力, 并改进象级级级级一般能力。在Z 的高效的 Z 级测试中, 将这些修改结果级级系统里, 在Z 级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级,,, 级级级级级级级级级级级级标为Z级级级级级级级级级级级级级级级级。

相关内容