Recent mask proposal models have significantly improved the performance of zero-shot semantic segmentation. However, the use of a `background' embedding during training in these methods is problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels. Furthermore, they ignore the semantic relationship of text embeddings, which arguably can be highly informative for zero-shot prediction as seen classes may have close relationship with unseen classes. To this end, this paper proposes novel class enhancement losses to bypass the use of the background embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores. To further capture the relationship between seen and unseen classes, we propose an effective pseudo label generation pipeline using pretrained vision-language model. Extensive experiments on several benchmark datasets show that our method achieves overall the best performance for zero-shot semantic segmentation. Our method is flexible, and can also be applied to the challenging open-vocabulary semantic segmentation problem.
翻译:最近的掩码建议模型大大改善了零射语义分解的性能,然而,在培训期间使用“后台”嵌入这些方法却成问题,因为由此形成的模型往往会过细,将所有看不见的类别指定为背景类,而不是正确的标签。此外,它们忽视了文本嵌入的语义关系,而文字嵌入的语义关系可以说对零射预测具有高度的启发性,因为所看到的等级可能与看不见的类别有着密切的关系。为此,本文件建议进行新的阶级强化损失,以避免在培训期间使用背景嵌入,同时利用文本嵌入和掩码建议之间的语义关系,对相似等级分进行排序。为进一步捕捉视觉和隐蔽类之间的关系,我们建议使用预先训练的视觉语言模型,建立一个有效的假标签生成管道。在几个基准数据集上进行的广泛实验表明,我们的方法在总体上实现了零弹射语义分解的最好性能。我们的方法是灵活的,也可以应用到具有挑战性的公开语言语义分解问题。