Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object class names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on the task of visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.
翻译:薄弱监管的短语基点旨在学习仅使用图像-感应对配方的区域口号通信。 因此,一个重大挑战在于图像区域与培训期间的句子之间缺少联系。 为了应对这一挑战,我们在培训时利用通用对象探测器,并提议一个匹配区域口号与图像-感应的对比的对比学习框架。 我们的核心创新是学习一个区域口号评分功能,在此基础上进一步构建图像-感应评分功能。 重要的是,我们的区域口号评分功能是通过从图像-感应对方中检测到的物体类名和候选词组之间的软匹配分数中提取而学到的,而图像-感应评分功能则由地面-真相图像-感应变对方监督。这种评分功能的设计消除了在测试时对对象进行检测的需要,从而大大降低了推断成本。 没有钟和哨子,我们的方法在视觉叙分任务上取得了最新的结果,超过了在测试时需要昂贵的物体探测器的以往方法。