Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. One of the critical challenges of this task is to align semantic representations for different modalities including vision and language. To achieve this, previous methods perform cross-modal interactions to update visual features but ignore the role of integrating fine-grained visual features into linguistic features. We present AlignFormer, an end-to-end framework for referring image segmentation. Our AlignFormer views the linguistic feature as the center embedding and segments the region of interest by pixels grouping based on the center embedding. For achieving the pixel-text alignment, we design a Vision-Language Bidirectional Attention module (VLBA) and resort contrastive learning. Concretely, the VLBA enhances visual features by propagating semantic text representations to each pixel and promotes linguistic features by fusing fine-grained image features. Moreover, we introduce the cross-modal instance contrastive loss to alleviate the influence of pixel samples in ambiguous regions and improve the ability to align multi-modal representations. Extensive experiments demonstrate that our AlignFormer achieves a new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg by large margins.
翻译:参考图像截面 旨在根据特定语言表达式分割感兴趣的图像区域, 这是一种典型的多式任务。 任务的关键挑战之一是将包括视觉和语言在内的不同模式的语义表达方式相匹配。 为了实现这一点, 以往的方法进行跨模式互动, 以更新视觉特征, 但却忽视了将细色视觉特征纳入语言特征的作用 。 我们展示了“ 调整 Former”, 是一个用于转换图像分割的端到端框架 。 我们的对称将语言特征视为基于中心嵌入的像素组合, 将语言特征作为感兴趣的区域的核心嵌入和部分 。 为了实现像素文本的对齐, 我们设计了一个视觉- LBA 双向关注模块( VLBA), 并采用对比性学习。 具体地说, VLBA 通过向每个像素展示语义文本表达方式, 并通过使用微色图像特征促进语言特征 。 此外, 我们引入了跨模式对比性实例, 以降低像标点为核心CO 样本的影响力, 影响, 并改进了我们FOR- CO 的大型实验能力, 将多式变式图像显示了我们 。