Video representation learning is a vital problem for classification task. Recently, a promising unsupervised paradigm termed self-supervised learning has emerged, which explores inherent supervisory signals implied in massive data for feature learning via solving auxiliary tasks. However, existing methods in this regard suffer from two limitations when extended to video classification. First, they focus only on a single task, whereas ignoring complementarity among different task-specific features and thus resulting in suboptimal video representation. Second, high computational and memory cost hinders their application in real-world scenarios. In this paper, we propose a graph-based distillation framework to address these problems: (1) We propose logits graph and representation graph to transfer knowledge from multiple self-supervised tasks, where the former distills classifier-level knowledge by solving a multi-distribution joint matching problem, and the latter distills internal feature knowledge from pairwise ensembled representations with tackling the challenge of heterogeneity among different features; (2) The proposal that adopts a teacher-student framework can reduce the redundancy of knowledge learnt from teachers dramatically, leading to a lighter student model that solves classification task more efficiently. Experimental results on 3 video datasets validate that our proposal not only helps learn better video representation but also compress model for faster inference.
翻译:视频代表学习是分类任务中的一个关键问题。 最近,出现了一个充满希望的、不受监督的、称为自我监督的学习模式,它探索了通过解决辅助任务进行特征学习的大量数据所隐含的内在监督信号。然而,这方面的现有方法在扩展为视频分类时受到两种限制。首先,它们只关注一项任务,而忽视了不同任务特点之间的互补性,从而导致视频代表不够优化。第二,高计算和记忆成本阻碍了它们在现实世界情景中的应用。在本文中,我们提出了一个基于图表的蒸馏框架,以解决这些问题:(1) 我们提出了从多重自我监督任务中传授知识的逻辑图表和代表图,在多个自我监督任务中,前者通过解决多分配联合匹配问题来蒸馏分类者一级知识,而后者则从对不同任务不同特性的挑战的组合中提取内部特征知识。 (2) 采用教师-学习框架的建议可以大大减少从教师那里学到的知识的重复性,导致一个较轻的学生模型,从而通过解决多种自我监督的任务转让知识。 前者通过多分配联合匹配的问题来蒸发分类,后者则从对内部特征知识进行更高效的演示结果。