In this paper, we tackle a new computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions. We first build a baseline method without finetuning nor distillation to utilize the knowledge in the existing CLIP model. We then develop a new method, MaskCLIP, that is a Transformer-based approach using mask queries with the ViT-based CLIP backbone to perform semantic segmentation and object instance segmentation. Here we design a Relative Mask Attention (RMA) module to account for segmentations as additional tokens to the ViT CLIP model. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features by avoiding the time-consuming operation to crop image patches and compute feature from an external CLIP image model. We obtain encouraging results for open-vocabulary panoptic segmentation and state-of-the-art results for open-vocabulary semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with custom categories.
翻译:在本文中,我们处理一种新的计算机视觉任务,即开放的单词孔全光截面,目的是对任意类型的基于文本的描述进行全光分解(后台语义标签+前景光谱分解),我们首先在没有微调或蒸馏的情况下建立一个基线方法,以便利用现有的CLIP模型的知识;然后我们开发一种新的方法,即MaskCLIP,这是一种以变换器为基础的方法,使用VIT为基础的 CLIP主干线的遮罩查询进行语义分解和对象实例分解。我们在这里设计了一个相对遮罩注意模块(RMA),用于核算作为VIT CLIP模型的额外代号的分块。MusCLIP学会了高效和有效利用预先训练的密度/本地的 CLIP特性,避免了作物图像补接和从外部 CLIP 图像模型中计算特征的耗时操作。我们获得了令人鼓舞的结果,以开放的词汇截面截面截面截面截面截面截面截面截面段和物体分块结果,用于在ADE20K和MACLS上显示定性图解的系统。