Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level similarity measure may ignore the global temporal context over a long time span, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal order by shuffling the video clips or sentences according to the temporal granularity. In this way, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between different video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.
翻译:在视频文本前的零光传输培训中,视频代表学习取得了成功,每个句子都经过培训,在共同的特征空间中接近配对视频剪辑。对于长视频,给一段描述段落,说明各句描述视频的不同部分,将所有句子剪辑对、段落和整段视频相匹配,但这种单位级相似性措施可能会在很长的时间内忽视全球时间背景,这不可避免地限制了一般化能力。在本文中,我们提议一个对比式学习框架T姆CLR,以对全视频和段落进行明确比较。由于视频/段落是在时间顺序的限制下作为剪辑/句子顺序排列的顺序排列的。我们用动态时间扭曲来计算句子配对的最低累积成本,作为序列级距离。然而,为了探究时间动态,我们可能会打破时间顺序顺序的一致性,根据时间性能调整视频剪辑或句子。我们通过这种方式获得剪辑/句子的演示,以了解时间信息,从而便于理解时间信息流/句子,从而便利顺序调整。我们用不同的视频排序方法,我们还可以进行不同的设计,在视频前练习中进行。