We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams. The code is released on the project website.
翻译:我们提出XKD,这是一个全新的自我监督框架,目的是从未贴标签的视频剪辑中学习有意义的代表。 XKD是经过培训的,有两种假任务。首先,进行蒙面数据重建,以学习特定模式的代表。接下来,在两种模式之间进行自我监督的跨模式知识蒸馏,通过师生设置进行自我监督的跨模式知识蒸馏,以学习补充信息。为了确定最有效的信息,以便转让,并解决可能阻碍知识转让的视听模式之间的领域差距,我们为有效的跨式蒸馏引入了域协调战略。最后,为了开发一种通用解决方案,能够同时处理视听流,我们拟议框架的一种模式-不可知变异模式被引入,同时使用相同的主干体,用于视听模式。我们提议的跨模式知识蒸馏将视频行动分类的线性评价第一至第一准确度提高8.4%,用于铀转化系数101,用于HMDB51的8.1%,用于营养-声音的13.8%,以及用于营养学的14.2%。此外,我们的模式-基因变异变量展示了能够处理不同数据流的网站。