Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method uses a vision language pre-training model to acquire heatmaps from Grad-CAM, which matches every query word with an image region, and uses the combined heatmap to rank the region proposals. In this paper, we propose two simple but efficient methods for improving this approach. First, we propose a target-aware cropping approach to encourage the model to learn both object and scene level semantic representations. Second, we apply dependency parsing to extract words related to the target object, and then put emphasis on these words in the heatmap combination. Our method surpasses the previous SOTA methods on RefCOCO, RefCOCO+, and RefCOCOg by a notable margin.
翻译:受微弱监督的视觉地面定位旨在用与特定语言查询相对应的图像预测区域,在培训阶段,目标对象和查询之间的映射并不为人所知。最先进的方法使用视觉语言预培训模型从Grad-CAM获得热映射,每个查询词都与图像区域相匹配,并使用综合热映射对区域提案进行排位。在本文中,我们提出了两种简单但有效的改进方法。首先,我们建议了一种目标定位方法,鼓励模型既学习对象,也学习场景层次的语义表达。第二,我们运用依赖性分析法提取与目标对象有关的文字,然后在热映射组合中强调这些词。我们的方法超过了以前关于RefCO、RefCO+和RefCOg的SOTA方法,我们的方法超过了以前在RefCO、RefCO+和RefCOg的显著边距。