Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
翻译:在计算机视野中,为避免常规参考定位中艰难的描述,引入了不平坦的参考定位模块(QAM)模块,该模块通过生成特定查询的视觉关注地图引入自上而下的视角。还引入了一个交叉模式对象匹配模块(COM)模块,该模块利用最近出现的图像文本匹配匹配预选模型CLIP来从自下而下的角度预测目标对象。在本文件中,我们提出了一个新的双向双向跨模式匹配(BICM)框架,以应对这些挑战。我们还提议了一个知识匹配模块(KAM),该模块将前一至下方的数据转换到前一至前一至前一至后方的数据定位框架。我们还提议一个知识匹配模块(KAM),该模块将前一至下方的图像匹配(COM) 模块利用最近出现的图像文本匹配预培训模型(CLIP) 来利用最近出现的图像文本匹配前一至下方的模型(CLIP) 来预测目标对象。随后,自上至下而上和自下而上而上方的预测通过类似性复变(SF)模块整合。我们还提议了一个知识匹配(KAM) 匹配模块,该模块,该模块将前一至前一至前一至前一至前一至前一至二至二至二至四至二至三至四至四至四至四至四的模型,将数据测试数据显示前的模型,以前的模型显示前一至四至四至二至四至四至四至四至四至四至四至四四四四至四至四至四至四至四四的模型的模型的模型,以前的模型的模型的模型将数据调整数据。