Referring Expression Comprehension (REC) has become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering. However, it has not been widely used in many downstream tasks because it suffers 1) two-stage methods exist heavy computation cost and inevitable error accumulation, and 2) one-stage methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-to-end manner. Instead of using the dominant anchor proposal fashion, we directly take the dense-grid of an image as input for a cross-attention transformer that learns grid-word correspondences. The final bounding box is predicted directly from the image without the time-consuming anchor selection process that previous methods suffer. Our model achieves the state-of-the-art performance on four referring expression datasets with higher efficiency, comparing to previous best one-stage and two-stage methods.
翻译:表达式理解(REC)已经成为视觉推理中最重要的任务之一,因为它是许多视觉和语言任务(如视觉问答)的一个基本步骤。然而,它没有被广泛用于许多下游任务,因为它遭受到:(1) 两阶段方法存在沉重的计算成本和不可避免的错误积累,(2) 一阶段方法必须依赖许多超参数(如锚)才能产生捆绑框。在本文中,我们提出了一个无提案的一阶段(PFOS)模型,它能够以文字查询为基础,以端对端的方式从图像中倒退区域利益。我们不使用主锚建议的方式,而是直接将一个图像的密集网格作为交叉注意变异器的投入,该变异器可以学习网格词通信。最后的捆绑框直接来自图像,而没有以往方法所经受的耗时锚选择过程。我们的模型在四个表达式数据集中实现了以更高效率引用的状态性表现,与以往最佳的一阶段和两阶段方法相比。