CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.
翻译:CLIP已经实现了新的和令人兴奋的联合视觉语言应用之一是开放词汇语义分割,它可以在给定任意文本查询的情况下定位任何段落。在我们的研究中,我们问是否可能在没有文本查询或预定义类别的任何用户指导的情况下发现语义段落,并使用自然语言自动标注它们?我们提出了一种新颖的问题零指导分割,以及利用两个预训练的通用模型DINO和CLIP解决此问题的第一个基线,而无需微调或分割数据集。总体思路是先将一张图像分割成小的超段,将它们编码成CLIP的视觉语言空间,将它们翻译为文本标签,并将语义相似的段落合并在一起。然而,关键的挑战是如何将视觉段落编码成平衡全局和局部上下文信息的分段特定嵌入,这两个信息对于识别都很有用。我们的主要贡献是一种新颖的注意力掩码技术,通过分析CLIP中的注意力层平衡两种上下文。我们还介绍了几个用于评估这个新任务的度量标准。利用CLIP的固有知识,我们的方法可以在博物馆人群中精确定位蒙娜丽莎(Mona Lisa)的绘画作品。项目主页:https://zero-guide-seg.github.io/.