CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.
翻译:CLIP已经启用了新的和令人兴奋的联合视觉语言应用程序之一是开放词汇分割,可以定位任何给定任意文本查询的部分。在我们的研究中,我们问是否可能在没有任何用户指导的情况下以自然语言自动标记发现语义片段?我们提出了一个新颖的问题零指导分割和第一个基线,利用两个预训练的通用模型DINO和CLIP来解决这个问题,不需要任何微调或分割数据集。这个想法是首先将图像分割成小的过度分割,将它们编码为CLIP的视觉语言空间,将它们翻译成文本标签,并将语义上相似的片段合并在一起。然而,关键挑战是如何将视觉片段编码成平衡全局和局部上下文信息的段特异性嵌入,这些信息对于识别都是有用的。我们的主要贡献是一种平衡两种上下文的新型注意遮罩技术,通过分析CLIP内部的注意力层实现。我们还介绍了几个评估这个新任务的指标。凭借CLIP的先天知识,我们的方法可以在博物馆人群中准确定位蒙娜丽莎画作。项目主页:https://zero-guide-seg.github.io/。