Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudo-mask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudo-mask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/.
翻译:现有的实例分割模型使用基础(训练)类别的手动掩模注释来学习任务特定信息。这些掩模注释需要大量的人力,限制了注释新类别的可扩展性。为了缓解这个问题,开放式词汇(OV)方法利用大规模的图像-标题对和视觉-语言模型来学习新类别。总之,OV方法使用来自基础注释的强监督和来自图像-标题对的弱监督来学习任务特定信息和新类别信息。这种强弱监督之间的差异导致在基础类别上的过拟合,导致在新类别上的泛化差。在这项工作中,我们提出了Mask-free OVIS管道,通过以弱监督的方式利用预训练的视觉-语言模型为图像-标题对中存在的对象进行定位,从而从生成的伪掩模注释中学习基础和新类别。然后利用生成的伪掩模注释来监督实例分割模型,使整个管道从任何昂贵的实例级注释和过拟合中获得解放。我们的广泛实验表明,我们的方法仅使用伪掩模进行训练,就可以显着提高MS-COCO数据集和OpenImages数据集的mAP得分,相对于使用手动掩模的最新的技术方法。代码和模型在https://vibashan.github.io/ovis-web/中提供。