In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics. Unlike previous methods which solve the problem in multiple stages (i.e., tracking, proposal-based matching), we tackle the problem from a novel perspective, \textbf{co-grounding}, with an elegant one-stage framework. We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency with co-grounding feature learning. Semantic attention learning explicitly parses referring cues in different attributes to reduce the ambiguity in the complex expression. Co-grounding feature learning boosts visual feature representations by integrating temporal correlation to reduce the ambiguity caused by scene dynamics. Experiment results demonstrate the superiority of our framework on the video grounding datasets VID and LiOTB in generating accurate and stable results across frames. Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset. Our project is available at https://sijiesong.github.io/co-grounding.
翻译:在本文中,我们处理在视频中提及表达理解的问题,由于复杂的表达方式和场景动态,这一问题具有挑战性。与以前在多个阶段(即跟踪、基于建议书的匹配)解决问题的方法不同,我们从新颖的角度,即\textbf{co-rosuring},用一个优雅的单阶段框架来解决这个问题。我们通过语义关注学习来提高单框架定位准确性,并改进与共同地基特征学习的跨框架基础一致性。语义关注学习以不同属性明确提供线索,以减少复杂表达的模糊性。共同地基学习功能通过整合时间相关性来增强视觉特征表现,以减少场景动态造成的模糊性。实验结果显示了我们在视频地基数据集VID和LOTB上生成准确和稳定结果的优势。我们的模型还适用于在图像中提及表达理解,通过RefCO数据集的改进性能来说明。我们的项目可在https://sijiesong.github.io/co-grofring查阅。