The prevailing framework for matching multimodal inputs is based on a two-stage process: 1) detecting proposals with an object detector and 2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects, and introduces a lightweight module to predict a score for aligning each proposal with a critical object. These scores can guide the NMS operation to filter out proposals irrelevant to the text query, increasing the recall of critical objects, resulting in a significantly improved matching performance. Since VL-NMS is agnostic to the matching step, it can be easily integrated into any state-of-the-art two-stage matching methods. We validate the effectiveness of VL-NMS on two multimodal matching tasks, namely referring expression grounding and image-text matching. Extensive ablation studies on several baselines and benchmarks consistently demonstrate the superiority of VL-NMS.
翻译:对多式联运投入进行匹配的主导框架基于一个两个阶段的过程:1)用对象探测器检测建议,2)用建议匹配文本查询。现有的两阶段解决办法主要侧重于匹配步骤。在本文件中,我们争辩说,这些方法忽略了两个阶段建议作用之间的明显差别:它们产生建议完全基于检测信任(即查询-不可知性),希望建议包含文本查询(即查询-认知)中提到的所有情况。由于这种不匹配,有可能在过滤过程中压制与文本查询有关的建议,这反过来又会约束匹配的性能。为此,我们提议VL-NMS,这是在第一阶段提出查询-认知建议的第一个方法。VL-NMS将所有提到的情况都视为关键对象,并引入一个轻量度模块,以预测每个建议与关键对象(即查询-查询-认知)相匹配的得分。这些分可以指导NMS业务过滤与文本查询无关的提案,增加关键对象的回音频值,从而在显著改进的性能匹配性能。自VL阶段的匹配方法可以持续地显示VNMS的两步。