Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the local associations between videos and texts are not modeled, restricting the pre-training models' generality, especially for tasks requiring the temporal video boundary for certain query texts. This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment such that the trained model can accurately perceive temporal boundaries in videos given the text description. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features. To produce temporal boundaries, frame features in several videos are manually merged into a long video sequence that interacts with a text sequence. With the localization task, our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality. Notably, comprehensive experimental results show that our method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval. The code will be released soon.
翻译:培训前的视频语言模型最近大大改进了多种模式的多式下游任务; 以往的主要工作主要是通过对比性学习,实现不同模式的全球特征统一; 然而,视频和文本之间的当地联系没有模型化,限制了培训前模式的一般性,特别是对于某些查询文本需要时间视频边界的任务而言。 这项工作引入了一个新颖的文本-视频本地化预文本任务,使经过培训的模型能够精确地看到根据文本描述在视频中的时间界限。 具体而言,文本视频本地化包括时间检索,根据文本描述预测视频的起始和终点界限,以及与视频特征相匹配的文本本地化。 为了制作时间界限,一些视频中的框架特征被手工合并成长的视频序列,与文本序列发生互动。与本地化任务,我们的方法将精细的框表述与单式文字描述相挂钩,并隐含地区分了单一模式中不同实例的表述。 值得注意的是,全面的实验结果显示,我们的方法将大大改进了视频的状态-艺术状态的起点界限, 包括各种时间化的图像检索, 将很快改进了各种时间定义的图像检索。