Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. We approach this problem by learning a two-pathway temporal contrastive model using unlabeled videos at two different speeds leveraging the fact that changing video speed does not change an action. Specifically, we propose to maximize the similarity between encoded representations of the same video at two different speeds as well as minimize the similarity between different videos played at different speeds. This way we use the rich supervisory information in terms of 'time' that is present in otherwise unsupervised pool of videos. With this simple yet effective strategy of manipulating video playback rates, we considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods across multiple diverse benchmark datasets and network architectures. Interestingly, our proposed approach benefits from out-of-domain unlabeled videos showing generalization and robustness. We also perform rigorous ablations and analysis to validate our approach.
翻译:学习只从少数贴标签的视频中识别动作是一个挑战性的问题, 原因是大量收集的活动标签很少。 我们通过以两种不同的速度学习双路时间对比模式, 以两种不同的速度使用未贴标签的视频, 利用变化的视频速度不会改变一个动作这一事实。 具体地说, 我们提议以两种不同的速度尽量扩大同一视频的编码表达方式之间的相似性, 并尽量减少以不同速度播放的不同视频之间的相似性。 这样, 我们就可以使用来自“ 时间” 的丰富的监管信息, 这些信息存在于其他不受监督的视频库中。 通过这种简单而有效的管理视频回放率策略, 我们大大地超越了复杂、 最先进的半超版图像识别方法的视频扩展, 跨多个不同的基准数据集和网络结构。 有趣的是, 我们提出的方法得益于显示通用性和稳健的外闭标签视频。 我们还进行了严格的校准和分析, 以验证我们的方法。