自我监督视频代表学习的时间差异图 (Temporal Contrastive Graph for Self-supervised Video Representation Learning)

Attempt to fully discover the temporal diversity and global-local chronological characteristics for self-supervised video representation learning, this work takes advantage of the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, the TCG roots in a hybrid graph contrastive learning strategy to regard the inter-snippet and intra-snippet temporal relationships as self-supervision signals for temporal representation learning. To increase the temporal diversity of features, the TCG integrates the prior knowledge about the frame and snippet orders into temporal contrastive graphs, i.e., the intra-/inter- snippet temporal contrastive graph modules. By randomly removing edges and masking node features of the intra-snippet graphs or inter-snippet graphs, the TCG can generate different correlated graph views. Then, specific contrastive losses are designed to maximize the agreement between node embeddings in different views. To learn the global context representation and recalibrate the channel-wise features adaptively, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders. Experimental results demonstrate the superiority of our TCG over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.

翻译：为了充分发现自我监督的视频代表学习的时间多样性和全球-本地时间顺序特点,这项工作利用视频的时间结构结构,进一步提出一种自监督的新方法,名为“时间对比图”。与在视频中随机冲洗视频框架或视频片断的现有方法相比,TCG根根植于一个混合图形对比学习战略,将片段间和片段内时间关系视为自监督时间代表学习的自我监督信号。为了增加时间特征的多样性,TCG将先前关于框架和片断顺序的知识整合到时间对比图中,即内部/间间间时间对比图模块。通过随机去除片段图或间片段图的边缘并遮盖节点特征,TCG可以产生不同的对比性图表观点。然后,具体对比性损失旨在最大限度地实现不同观点节点嵌入的一致。学习全球背景和片段间框架和片段定序,即内部/间时间对比图形模块内部的相对对比式图形模型模块,从而展示了我们之间在图像级上进行大幅调整的动态定位的图像- 策略,从而展示了我们之间在图像级上进行升级的升级的图像- 智能- 预测的图像- 机序中,从而展示了我们在视频- 方向上显示- 的图像- 智能- 的图像- 上显示- 智能- 机序上显示- 显示- 的图像- 智能- 智能- 智能- 的图像- 上显示- 的图像- 的图像- 的图像- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 显示-智能- 动作- 显示- 显示- 显示- 动作- 上- 动作- 显示- 显示- 显示- 动作- 机序- 排序- 动作- 机序- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 上- 动作- 动作- 动作- 动作- 动作-智能- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 动作- 的