In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.
翻译:在这项工作中,我们处理分析分区这一具有挑战性的任务。 引用分区的查询表达方式通常通过描述其与他人的关系来显示目标对象。 因此, 要在图像中找到目标对象之一, 模型必须对整个图像有全面的理解。 要做到这一点, 我们重新将分割作为直接关注问题: 在查询语言表达最关注的图像中找到区域。 我们引入变压器和多头关注来建立一个网络, 带有编码- 解码关注机制结构, 以“ 查询” 表达语言的图像来显示目标对象 。 此外, 我们提议了一个Query 一代模块, 生成多组关注度不同的问题, 代表不同方面对语言表达的多样化理解。 同时, 为了找到这些基于视觉线索的多样化理解的最佳方法, 我们进一步建议一个调适选择这些查询的输出特征的调和平衡模块, 用于更好的掩码生成。 没有响音和提示, 我们的方法是轻量的, 并实现新的CO- 正弦化表现, 并代表了不同方面对 Grefru- Refrefregue G/Refregude, 我们现有的分解/Ref- Refrefrefrecom 。