In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
翻译:在这项工作中,我们侧重于实例层面的开放词汇分解,打算扩大一个分解器,例如没有掩码说明的分解器。我们调查了一个简单而有效的框架,在图像字幕的帮助下进行简单而有效的框架,重点是在字幕中利用数千个对象名,以发现新类的事例。我们不是采用预先训练的字幕模型,也不是使用具有复杂管道的大量字幕数据集,而是从两个方面提出端对端解决方案:字幕地基和字幕生成。特别是,我们根据面具变换器基线设计了一个联合定位和生成框架(CGG) 。这个框架有一个新的基质损失,可以进行明确和隐含的多模式特征调整。我们进一步设计了一个轻量级字幕生成头,以允许额外的字幕监督。我们发现,这种基点和生成可以相互补充,大大增强新类别中的分解性性功能。我们用两个环境对COCO数据集进行了广泛的实验:公开词汇分解(OVISIS)和开源码分解(CGGG)框架(CGG)框架。结果显示我们的CGIF框架优于以前的OVIS新版本方法优于先前的优于OVIS)方法,在15-SA标准中也实现了大规模改进了我们的PSAQ。