Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed Istance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.
翻译:指代图像分割(RIS)旨在在输入图像的区域中寻找一个分割掩模,给定一个指代语言表达式。然而,收集标注数据集是极其昂贵和费力的。为了解决这个问题,我们提出了一种简单而有效的零样本指代图像分割方法,利用预先训练的CLIP跨模态知识。为了获取基于输入文本的分割掩模,我们提出了一个掩模引导的视觉编码器,该编码器捕获输入图像的全局和局部上下文信息。通过利用从现成的掩模建议技术获得的实例掩模,我们的方法能够细分详细的实例级基础。我们还引入了一个全局-本地文本编码器,其中全局特征捕获整个输入表达式的复杂句子级语义,而局部特征则专注于由依赖解析器提取的目标名词词组。在实验中,所提出的方法优于该任务的多种零样本基线甚至是弱监督指代表达式分割方法,并有实质性的提升。我们的代码可在https://github.com/Seonghoon-Yu/Zero-shot-RIS 上获得。