As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.
翻译:作为向视觉推理迈出的重要一步,视觉定位(例如,短语定位,参考表达理解/部分)已被广泛探讨过,以前在提及表达理解(REC)或分区(RES)时,由于两阶段的设置,其性能有限,或者需要设计复杂的任务特定一阶段结构。在本文件中,我们提议了一个简单的单阶段多任务框架,用于视觉定位任务。具体地说,我们利用一个变压器结构,其中两种模式结合在视觉语言编码器中。在解码器中,模型学会产生背景化的语文查询,然后解码并用来直接反转捆绑框,为相应区域制作一个分割面罩。利用这一简单但高度背景化的模式,我们在REC和RES任务上以一个很大的空间来超越了艺术状态方法。我们还表明,一个简单的训练前时间表(外部数据集)可以进一步改进性能。广泛的实验和比较表明,我们的模型从背景化信息和多任务培训中获益良多。