Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations. Existing extensions of contrastive learning to the domain of video data however do not explicitly attempt to represent the internal distinctiveness across the temporal dimension of video clips. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The first loss adds the task of discriminating between non-overlapping clips from the same video, whereas the second loss aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the features. Temporal contrastive learning achieves significant improvement over the state-of-the-art results in downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on video datasets across multiple 3D CNN architectures. With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.
翻译:现有对比式学习扩展到视频数据领域,但并不明确试图代表视频片段时间层面的内部差异。我们开发了一个新的时间对比式学习框架,包括两个新的损失,以改善现有的对比式自我监督视频演示学习方法。第一个损失增加了区分同一视频中非重叠片段的任务,而第二个损失的目的是区分一个输入片段特征图的时段,以增加这些特征的时间多样性。时间对比式学习在下游视频理解任务中取得了显著的改进,例如行动识别、有限标签行动分类和近邻视频在多个3DCNN结构的视频数据集上检索。在常用的3D-ResNet-18结构中,我们实现了82.4%(比前一个最佳)UCF101和52.9%(+5.4%增加)的UFC101%和HMDB51最大动作分类的56%和RMDB51最大恢复率为51%。