Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies.
翻译:视频-视频预科培训(VTP)的目的是从大型网络视频中为各种下游任务学习可转移的代表性。迄今为止,几乎所有现有的VTP方法都局限于基于检索的下游任务,例如视频检索,而其在基于本地化的任务(例如时间定位)上的转移潜力则未得到充分探讨。在本文中,我们实验性地分析和展示目前VTP方法与本地化任务不相容,并提议一个新的本地化面向视频-版本预培训框架,称为LocVTP。具体地说,我们进行细微的对比性调整,以补充剪刀式通信发现计划粗略的下游任务。为了进一步提高所学特征的时间推理能力,我们提出了背景预测头和时间意识对比性损失,以了解背景关系。对六个数据集的四项下游任务进行的广泛实验表明,我们的LocVTP在基于本地化和基于本地化的任务中都取得了最新水平的绩效。此外,我们进行了全面的研究和深入分析,以探讨最佳的策略。