The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. To address these challenges, this paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
翻译:自我监督的视频代表学习的核心在于从未贴标签的视频中建立一般特征,然而,最近的一些工作主要侧重于高层语义和被忽视的低层代表及其时间关系,这些对于一般视频理解至关重要。为应对这些挑战,本文件提议了一个多层次的功能优化框架,以提高学习视频代表的概括和时间建模能力。具体地说,利用天真和典型对比学习获得的高层次特征来构建分布图,指导中低层次特征学习进程。我们还设计了一个从多层次特征中建立简单的时间建模模块,以加强运动模式学习。实验表明,借助图形限制和时间建模的多层次特征优化可以极大地提高视频理解的体现能力。