Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed Istance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.
翻译:零样本指代图像分割与全局-本地语境特征
引用图像分割(RIS)旨在查找一个分割掩模,给定一个基于输入图像区域的指代表达式。然而,为此任务收集标注数据集是极其昂贵和耗时的。为了克服这个问题,我们提出了一种简单而有效的零样本引用图像分割方法,利用了从CLIP预训练的跨模态知识。为了获得与输入文本相关的分割掩模,我们提出了一个引导掩模的视觉编码器,可以捕获输入图像的全局和本地上下文信息。通过利用从现成的掩模提议技术中获得的实例掩模,我们的方法能够细分详细的实例级接地。我们还介绍了一个全局-本地文本编码器,其中全局特征捕获整个输入表达式的复杂句子级语义,而本地特征则专注于由依赖解析器提取的目标名词短语。在我们的实验中,该方法优于任务的几个零样本基线,甚至优于弱监督引用表达式分割方法。我们的代码可在https://github.com/Seonghoon-Yu/Zero-shot-RIS上获得。