Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. We approach this problem by learning a two-pathway temporal contrastive model using unlabeled videos at two different speeds leveraging the fact that changing video speed does not change an action. Specifically, we propose to maximize the similarity between encoded representations of the same video at two different speeds as well as minimize the similarity between different videos played at different speeds. This way we use the rich supervisory information in terms of `time' that is present in otherwise unsupervised pool of videos. With this simple yet effective strategy of manipulating video playback rates, we considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods across multiple diverse benchmark datasets and network architectures. Interestingly, our proposed approach benefits from out-of-domain unlabeled videos showing generalization and robustness. We also perform rigorous ablations and analysis to validate our approach. Project page: https://cvir.github.io/TCL/.
翻译:由于大量收集的活动标签很少,我们通过以两种不同的速度学习双路时间比对式视频模式来解决这一问题。我们以两种不同的速度学习双路时间比对式视频,利用不断变化的视频速度不会改变一个动作这一事实。具体地说,我们提议以两种不同的速度尽量扩大同一视频编码表达方式之间的相似性,并尽量减少以不同速度播放的不同视频之间的相似性。这样,我们用在“时间”方面的丰富的监督信息来验证我们的做法。我们用这种简单而有效的战略来操纵视频回放率,我们大大超过了复杂的、最先进的半超版图像识别方法在多种不同基准数据集和网络结构中的完美视频扩展。有趣的是,我们提议的方法是从显示一般化和稳健性的外部无标签视频中受益。我们还用严格的布局和分析来验证我们的做法。项目网页: https://cvir.github.io/TCL/。