视觉语言变形器和断分路查询生成器 (Vision-Language Transformer and Query Generation for Referring Segmentation)

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

翻译：在这项工作中,我们处理分析分区这一具有挑战性的任务。引用分区的查询表达方式通常通过描述其与他人的关系来显示目标对象。因此, 要在图像中找到目标对象之一, 模型必须对整个图像有全面的理解。要做到这一点, 我们重新将分割作为直接关注问题: 在查询语言表达最关注的图像中找到区域。我们引入变压器和多头关注来建立一个网络, 带有编码- 解码关注机制结构, 以“ 查询” 表达语言的图像来显示目标对象。此外, 我们提议了一个Query 一代模块, 生成多组关注度不同的问题, 代表不同方面对语言表达的多样化理解。同时, 为了找到这些基于视觉线索的多样化理解的最佳方法, 我们进一步建议一个调适选择这些查询的输出特征的调和平衡模块, 用于更好的掩码生成。没有响音和提示, 我们的方法是轻量的, 并实现新的CO- 正弦化表现, 并代表了不同方面对 Grefru- Refrefregue G/Refregude, 我们现有的分解/Ref- Refrefrefrecom 。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

多样性文本生成任务的研究进展

专知会员服务

43+阅读 · 2021年4月23日

知识增强的文本生成研究进展

专知会员服务

100+阅读 · 2021年3月6日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日