Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated proposals or anchors, and fuse these features with the text embeddings to locate the target mentioned by the text. However, modeling the visual features from these predefined locations may fail to fully exploit the visual context and attribute information in the text query, which limits their performance. In this paper, we propose a transformer-based framework for accurate visual grounding by establishing text-conditioned discriminative features and performing multi-stage cross-modal reasoning. Specifically, we develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions while suppressing the unrelated areas. A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness. To retrieve the target from the encoded visual features, we further propose a multi-stage cross-modal decoder to iteratively speculate on the correlations between the image and text for accurate target localization. Extensive experiments on five widely used datasets validate the efficacy of our proposed components and demonstrate state-of-the-art performance. Our code is public at https://github.com/yangli18/VLTVG.
翻译:视觉地面是定位自然语言表达式所显示的目标的任务。 现有方法将通用对象检测框架扩展至这一问题。 它们基于预产生的建议或锚的特征进行视觉地面定位, 并将这些特征与文本嵌入的文字嵌入连接起来以定位文本提及的目标。 然而, 这些预定义的位置的视觉地面特征建模可能无法充分利用文本查询中的视觉背景和属性信息, 从而限制其性能。 本文中, 我们提议了一个基于变压器的框架, 以便通过建立文本限制的区别性特征和进行多阶段跨模式推理来准确的视觉地面定位。 具体地说, 我们开发了一个视觉语言语言校验模块, 将视觉特征集中在与文本描述相关的区域, 同时压制不相关的区域。 语言制导功能编码也设计了一种语言图像环境的模型, 来汇总目标对象的视觉背景, 从而限制其性能。 为了从编码的视觉特征中检索目标, 我们进一步提议了一个多阶段跨模式解码, 以对图像/ 文本之间的相关性进行迭接猜, 用于准确的本地化 。 我们的图像/ G/ 广泛实验 用于我们 的状态。