Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
翻译:鉴于收集了未经剪辑和未分割的视频,视频场点时间检索(VCMR)是为了检索一个时间时刻(即一段视频的一小部分),这种时间时刻与特定文本查询相对应。由于视频和文本来自两个不同的特征空间,因此有两种处理VCMR的一般方法:(一) 分别对每种模式的表达方式进行编码,然后对两种模式的查询处理表达方式加以调整,以及(二) 采用细微区分的跨模式互动,以学习用于查询处理的多式表达方式。虽然第二种方法往往能提高检索的准确性,但第一种方法是更有效率得多的。在本文件中,我们提议为VCMR建立一个与对比学习学习(RELCLNet)的检索和本地化网络网络。我们采用第一种方法,引入两个对比学习目标来改进视频编码和文字编码,以便分别学习视频和文字表达方式,同时为VCMRMR更好地调整。视频对比学习(VDOCL)是为了在视频级别上最大限度地增加查询和候选人视频视频视频视频的相互交流信息,虽然在图像检索阶段里标,在区域里学习,但显示图像检索(FCLCLVCR-CR)的目标是在时间级的图像检索阶段学习结果,在区域里,在图像检索中进行对比学习,在图像定位和图像检索中,在区域里学习,在图像检索中将显示,在图像定位上的目标是显示,在区域里学习,在时间轴标值学习,在图像检索中显示,在区域里学习,在图像定位中,在图像检索中,在区域里。