We investigate the problem of video Referring Expression Comprehension (REC), which aims to localize the referent objects described in the sentence to visual regions in the video frames. Despite the recent progress, existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects. To this end, we propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners. Firstly, we aim to build the inter-frame correlations for all existing instances within the frames. Specifically, we compute the inter-frame patch-wise cosine similarity to estimate the dense alignment and then perform the inter-frame contrastive learning to map them close in feature space. Secondly, we propose to build the fine-grained patch-word alignment to associate each patch with certain words. Due to the lack of this kind of detailed annotations, we also predict the patch-word correspondence through the cosine similarity. Extensive experiments demonstrate that our DCNet achieves state-of-the-art performance on both video and image REC benchmarks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimal model designs. Notably, our inter-frame and cross-modal contrastive losses are plug-and-play functions and are applicable to any video REC architectures. For example, by building on top of Co-grounding, we boost the performance by 1.48% absolute improvement on Accu.@0.5 for VID-Sentence dataset.
翻译:我们调查了视频参考表达理解(REC)的问题,目的是将句子中描述的引用对象定位到视频框架的视觉区域。尽管最近取得了进展,但现有方法存在两个问题:(1) 视频框架的本地化结果不一致;(2) 参考对象和背景对象之间的混淆。为此,我们提议建立一个新型双对称网络(以DCNet为底盘),明确加强内部和跨模式方式的密集关联。首先,我们的目标是为框架中的所有现有实例建立跨框架的相对关系。具体地说,我们计算框架间补丁对齐的连结相似性,以估计密集的对齐率,然后进行跨框架对比学习,以在功能空间接近这些对象进行绘图。第二,我们建议建立一个精细的双对称对称网络(以DCNet为底盘)网络(以DCNet为底盘),由于缺少这种详细说明,我们还预测通过网络的相似性能图解对应。广泛的实验表明,我们的DCNet在视频和高清晰的图像结构上,我们通过视频和高清晰的图像模型进行最精确的图像分析,我们用最精确的模型和最精确的图像结构进行最精确的对准的升级的对等的对等的对比研究。