In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.
翻译:在本文中,我们研究视觉地基问题,方法是既考虑短语提取,又考虑图像地基(PEG) 。与前一个已知的直径测试设置相比,PEG需要一个模型,从文本中提取短语,同时从图像中定位对象,这是实际应用中更实际的设置。由于短语提取可被视为一个$D的文本分割问题,我们将PEG作为双重检测问题,并提议一个新的DQ-DETR模型,该模型引入双重查询,以探测与图像和文字的不同特征,用于目标预测和遮罩预测。每对双重查询的设计都是为了共享定位部分,但内容部分不同。这种设计可以有效减轻图像和文本之间模式一致的难度(与单一查询设计相比),并授权变换器解码器利用遮罩引导的注意来提高绩效。为了评估PEG,我们还提出了一个新的通用CMAP(跨模式平均精确度),类似于目标探测的AP 矩阵。新的指标将克服在多框-美元-美元-101 内调调数据库中,在图像-101 TR3 的测试中将我们的标准-RE-reval-ex-ex-ex-rational-ex-ral-ex-leg-leg-leg-ex-ex-leg-leg-al-leg-leg-leg-leg-leg-leg-leg-leg-al-legal-leg-lection-res-leg-legal-leg-leg-leg-legal-legal-legal-lection-lection-lection-lection-lection-leg-leg-leg-leg-lection-leg-lemental-lemental-leg-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lection-legal-lection-lection-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lemental-lement