Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
翻译:然而,大多数以往方法都直接继承或调整典型的图像培训前模式,使其适应视频语言培训前模式,因此没有充分利用视频的独特性,即时间。在本文中,我们提议采用一个高层次的时空软件视频语言培训前框架,即HTeA,其中有两个新颖的培训前任务,用于模拟时空和文本之间的跨模式协调以及视频文本对应的时际关系。具体地说,我们提出一个跨时空探索任务,以探索视频片刻,从而产生详细的视频时刻代表。此外,通过将视频文本配对在整个不同时间分辨率中与多时际时间关系探索任务相匹配,可以捕捉到固有的时间关系关系。此外,我们引入了模拟测试,以评价数据集和视频语言预培训模式之间的时间依赖性。我们在15种成熟的视频语言理解和生成任务上取得了最新的成果,特别是在时间导向型数据配置上,从而产生详细的视频时刻代表。此外,通过将模型对视频配对整体时间进行匹配,将S-VTERS-BAR 和S-BAR-S-S-S-BAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-BAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S