Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. It is an important step toward reducing laborious human supervision. Most existing works first pretrain a model on captioned images covering many novel classes and then finetune it on limited base classes with mask annotations. However, the high-level textual information learned from caption pretraining alone cannot effectively encode the details required for pixel-wise segmentation. To address this, we propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Thus, our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model. To account for noises in pseudo masks, we design a robust student model that selectively distills mask knowledge by estimating the mask noise levels, hence mitigating the adverse impact of noisy pseudo masks. By extensive experiments, we show the effectiveness of our framework, where we significantly improve mAP score by 4.5% on MS-COCO and 5.1% on the large-scale Open Images & Conceptual Captions datasets compared to the state-of-the-art.
翻译:开放词汇区隔旨在分割没有遮罩说明的小类。 这是减少人类劳动监督的一个重要步骤 。 多数现有框架首先对包含许多小类的字幕图像模型进行预设, 然后用遮罩说明对它进行微调。 但是, 仅从字幕前段获取的高级文本信息无法有效地将像素错分层所需的细节编码出来。 为了解决这个问题, 我们提议了一个跨模式的假标签框架, 通过调整图像中对象面罩视觉特征的字幕中的文字语义来生成假面具培训。 因此, 我们的框架能够通过文字语义将小类标注在标题中, 给学生模型自我培养。 为了记录假面罩中的噪音, 我们设计了一个强大的学生模型, 通过估计遮罩的噪音水平来选择性地淡化遮罩知识, 从而减轻噪音假面具的不利影响。 通过广泛的实验, 我们展示了我们的框架的有效性, 我们通过这个框架大大改进了 mAP的评分, 以4.5% MS- CO 和 5.1% 在大型开放图像和 Capations 上对状态数据进行对比。