Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations. Existing extensions of contrastive learning to the domain of video data however, rely on naive transposition of ideas from image-based methods and do not fully utilize the temporal dimension present in video. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The first loss adds the task of discriminating between non-overlapping clips from the same video, whereas the second loss aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the features. Temporal contrastive learning achieves significant improvement over the state-of-the-art results in downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on video datasets across multiple 3D CNN architectures. With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.
翻译:现有对比式学习扩展到视频数据领域,但依赖天真地从图像法中转换想法,不充分利用视频中存在的时间维度。我们开发了一个新的时间对比学习框架,其中包括两个新颖的损失,以改善现有对比式自我监督的视频演示学习方法。第一个损失增加了区分同一视频中非重叠剪辑的任务,而第二个损失的目的是区分一个输入剪辑特征图的时段,以增加这些特征的时间多样性。时间对比式对比学习在下游视频理解任务中取得了显著改进,例如行动识别、有限标签行动分类和在多个3DCNN结构中最接近邻居的视频数据集检索。在常用的3D-ResNet-18结构中,我们实现了82.4%(比以往最佳)UCF101-11%和52.9%(+54%)的高级对比性学习,以及HMDB+51%的升级(HMDB++51%)的升级和最新HMDBV5101%的升级率(+5.4%)的升级。