Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. Existing methods either adopt the pixel classification-based or the learnable query-based framework for mask generation, both of which are insufficient to deal with various text-image pairs with a fix number of parametric prototypes. In this work, we propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation, dubbed LGFormer. It views the linguistic features as query to generate a specialized prototype for arbitrary input image-text pair, thus generating more consistent segmentation results. Moreover, we design several cross-modal interaction modules (\eg, vision-language bidirectional attention module, VLBA) in both encoder and decoder to achieve better cross-modal alignment.
翻译:意指图像分割的语言查询引导遮罩生成
意指图像分割指的是根据给定的语言表述对图像区域进行分割,这是典型的多模态任务。现有方法要么采用基于像素分类的方法,要么采用可学习的查询为基础的方法进行遮罩生成,都无法处理带有固定数量的参数原型的各种文本-图像对。在本文中,我们提出了一种建立于transformer之上的端到端框架,可以执行基于语言查询的遮罩生成,称为LGFormer。它将语言特征视为查询,为任意输入图像-文本对生成专门的原型,从而生成更一致的分割结果。此外,我们在编码器和解码器中设计了几个跨模态交互模块(例如,视觉-语言双向注意力模块,VLBA)以实现更好的跨模态对齐。