Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.
翻译:文本到视频检索( T2VR) 的当前方法在MSVD、 MSSR- VTT 和 VATEX 等以视频为主的数据集中经过培训和测试。 这些数据集的关键属性是假设视频在时间上是临时的预断,但所提供的字幕很好地描述了视频内容的格子。 因此,对于配对视频和字幕来说,视频应该与标题完全相关。 然而,在现实中,由于查询不为前置之人所知,预剪视频剪辑可能没有足够内容以完全满足查询。这表明文献与真实世界之间存在差距。 为了填补这一差距,我们在此文件中建议一个名为“部分相关的视频Retrival(PRVR)”的新版本的T2VR 子片段。对于一个未剪辑的视频,如果它包含一个与查询相关的时刻,那么一个给的文本查询。 PRVR 旨在从一个大范围的图像库中取取出部分相关的视频。