We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and adapting the learned image-level understanding to the segmentation task. However, these methods based on CL have a discrepancy since it only considers image-text level alignment in training time, while the segmentation task requires region-text level alignment at test time. In this paper, we propose a novel Text-grounded Contrastive Learning (TCL) framework to directly align a text and a region described by the text to address the train-test discrepancy. Our method generates a segmentation mask associated with a given text, extracts grounded image embedding from the masked region, and aligns it with text embedding via TCL. The framework addresses the discrepancy by letting the model learn region-text level alignment instead of image-text level alignment and encourages the model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performance with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.
翻译:我们处理开放世界语义分解问题,目的是通过只使用图像-文字对应,在图像中分解任意的视觉概念,不作密集的注解。现有的开放世界分解方法通过使用对比性学习(CL)来学习不同的视觉概念和将学习的图像级理解与分解任务相适应,展示了令人印象深刻的进展。然而,基于CL的这些方法存在差异,因为它只考虑培训时间的图像-文字级对齐,而分解任务则要求测试时间的区域-文字级对齐。在本文中,我们建议建立一个新的文本-地面反向学习(TCL)框架,直接对文本和文本描述的区域进行对齐,以解决火车-测试差异问题。我们的方法产生了一个与给定文本相联的分解面遮掩,从遮蔽区域提取了有根的图像,将其与通过TCLLL进行嵌入的文本对齐。框架通过让模型学习区域-文字级对齐而不是图像/文字级对齐,并鼓励模型直接改进生成的分解面面面面面具的质量。此外,为了严格和公平地比较,我们提出了一个与特定的分级评价协议,我们在轨道上可以广泛使用。