Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.
翻译:先前的行动代表学习工作侧重于短视频剪辑的全球展示。 相反,许多实用应用,例如视频校正,强烈要求学习长视频的密集演示。 在本文中,我们引入了一个新的对比性行动代表学习框架(CARL),以以自我监督或薄弱监督的方式学习框架化的行动表现,特别是长视频。具体地说,我们引入了一个简单而有效的视频编码器,通过结合混音和变压器,既考虑空间背景,也考虑时间背景。受最近自我监督学习的巨大进展的启发,我们建议对通过扩大两个版本的阵列时间描述而获得的两套相关图像对比损失(SCL)加以应用。一个自我监督的版本,以自我监督的方式学习框架,以自我监督的方式学习。一个自我监督的版本,通过将两个放大视图的序列的顺序与先前的距离分布相近而优化。另一个是薄弱的版本,通过动态时间缩略图(DTTW)的缩略图和前期的滚动动作展示了我们之前的精度,而最后的平级的动作展示了我们之前的平级的动作。