通过以类似为基础的知识蒸馏,进行辅助学习,通过类似性为基础的知识蒸馏,实现自我监督视频代表制 (Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation)

Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetics-400. Our method deploys a teacher network that iteratively distills its knowledge to the student model by capturing the similarity information between segments of unlabelled video data. The student model meanwhile solves a pretext task by exploiting this prior knowledge. We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations. Our experimental results show superior results to the state of the art on both UCF101 and HMDB51 datasets when pretraining on K100 in apple-to-apple comparisons. Additionally, we show that our auxiliary pretraining, auxSKD, when added as an extra pretraining phase to recent state of the art self-supervised methods (i.e. VCOP, VideoPace, and RSPNet), improves their results on UCF101 and HMDB51. Our code is available at https://github.com/Plrbear/auxSKD.

翻译：尽管自我监督的视频代表学习预培训方法取得了杰出的成功,但是当用于预培训的未贴标签的数据集很小,或者源任务(准备)中未贴标签的数据与目标任务(调整)中标签数据之间的域差差异很大时,它们一般化得不好。为了缓解这些问题,我们建议采用一种新颖的办法,通过辅助前培训阶段,以知识相似性蒸馏、auxSKD为基础,补充自我监督的预培训阶段,以便更好地概括视频数据,例如动因-100而不是动因-400。我们的方法部署了一个教师网络,通过捕捉未贴标签的视频数据各部分之间的相似性信息,反复向学生模型提取其知识。学生模型同时通过利用这一先前的知识,解决了借口任务。我们还引入了一个新的托辞任务,即Pace Survicion 或VSPP,这要求我们用模型来预测一个随机选择的视频版本的回放速度,用于提供更可靠的自我监督的U-COP-101版本。我们的实验结果显示K-K-K-K-K-K-K-K-K-K-K-Arevial在艺术之前的数据上显示高端数据,而显示一个高端数据,这是在A-MD-MD-MD-S-S-S-S-S-Ad-I-I-Areal-I-I-Axxxxxxxxxxxxxxxxx的上的数据上的高级数据上的高级数据上的高级数据上的上显示高结果。