Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.
翻译:视觉地基是一项旨在根据自然语言表达方式定位目标对象的任务。 作为一个多模式的任务, 将文字和视觉投入之间的相互作用作为特点至关重要。 但是, 以前的解决方案主要在将每种模式混在一起之前独立处理, 而在提取视觉特征时没有充分利用相关的文字信息。 为了在视觉地基中更好地利用文字- 视觉关系, 我们提议了一个有一定条件的“ 问答模块 ” ( QCM ), 通过将查询信息纳入共进内核的生成中, 提取有查询觉识的视觉特征。 在我们提议的 QCM 中, 下游聚变聚变模块的视觉特征更具有歧视性, 侧重于表达式中描述的预期对象, 导致更准确的预测。 在三个流行的视觉地基数据集上进行的广泛实验表明, 我们的方法达到了最先进的性能。 此外, 查询- 视觉特征具有足够的信息, 足以实现与直接用于预测而没有进一步多模式融合的最新方法的可比性能。