We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. We show by experiments that augmentations can be manipulated as regularization to guide the network to learn desired semantics in contrastive learning, and we propose a way for the model to separately capture spatial and temporal features at multiple scales. We also introduce an approach to overcome the problem of divergent levels of instance invariance at different hierarchies by modeling the invariance as loss weights for objective re-weighting. Experiments on downstream action recognition benchmarks on UCF101 and HMDB51 show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial improvements over directly learning spatial-temporal features as a whole and achieves competitive performance when compared with other state-of-the-art unsupervised methods. Code will be made available.
翻译:我们展示了一种自我监督视频代表学习的新技术,其方法是:(a) 将学习目标分离成两个对比鲜明的子任务,分别强调空间和时间特点,以及(b) 按等级执行,鼓励多层次理解。我们首先以监督学习的效果为动力,引入了空间时空特征学习脱钩和等级学习与未经监督的视频学习背景。我们通过实验显示,增强可以被操纵为正规化,指导网络学习对比学习所需的语义,我们提出一种方法,使模型能够在不同尺度上分别捕捉空间和时空特征。我们还采用一种方法,通过将变量作为损失权重的模型,作为客观重新加权损失权重来克服不同层次层次的反复情况问题。关于UCFC101和HMDB51下游行动识别基准的实验表明,我们提议的高度脱钩式空间-时空对比(HDC)在直接学习空间-时局特征方面大有改进,并且与其他州级标准方法相比,将实现竞争性业绩。