Video self-supervised learning is a challenging task, which requires significant expressive power from the model to leverage rich spatial-temporal knowledge and generate effective supervisory signals from large amounts of unlabeled videos. However, existing methods fail to increase the temporal diversity of unlabeled videos and ignore elaborately modeling multi-scale temporal dependencies in an explicit way. To overcome these limitations, we take advantage of the multi-scale temporal dependencies within videos and proposes a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal Knowledge Discovering (STKD) module is first introduced to extract motion-enhanced spatial-temporal representations from videos based on the frequency domain analysis of discrete cosine transform. To explicitly model multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG). Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different graph views. To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which leverages the relational knowledge among video snippets to learn the global context representation and recalibrate the channel-wise features adaptively. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
翻译:视频自我监督学习是一项具有挑战性的任务,它要求模型具有巨大的表达力,以利用丰富的时空空间知识,并从大量未贴标签的视频中生成有效的监管信号。然而,现有方法未能增加未贴标签视频的时间多样性,并明显忽视精心建模的多尺度时间依赖性。为了克服这些限制,我们利用视频中多尺度的时间依赖性,并提议一个新型视频自我监督学习框架,名为“时空对流图像学习”(TCGL),它联合模拟机场间和机场内时间依赖性,以便用混合图形对比学习的学习策略来学习时际代表性功能。具体地说,一个空间对时间依赖性知识的拆解模块(STKD)首先从视频中提取运动增强的空间时空表达力,基于离心变的频域动作分析。在未贴标签的视频中明确建模多尺度的时间依赖性时间,我们的技术合作时间依赖性图像模型将关于框架和机场内时间依赖性关系序列的前项知识整合成图表结构结构结构, i.Slodal-dealal-dealdealalalalalalalalal-deal-deal-dealtialismalal-deal-deal-deal-de-dealismalismal-deal-deal-deal-deal-deal-deal-deal-de-deal-deal-dealisquilisal-de-I i-deal-slation theslational-intra i-i-deal-intra-intra、i-I-ind-indal-deal-I-I-I-deal-I-I-I-I-I-I-I-I-I-I-I-I-I-Idal-Idal-I-I-I-inal-Idal-I-I-I-I-I-I-Idal-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I