Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.
翻译:跨模态蒸馏被广泛用于知识跨模态传递,丰富目标环境下单模态表示。近期的研究将视觉和声音之间的时间同步与跨模态知识的语义一致性高度关联。由于不相关模态噪声和不同的语义相关性,这种来自同步的语义一致性在非受限视频中很难保证。为此,我们首先提出了一个“模态噪声滤波器”(MNF)模块,通过跨模态上下文来抹去教师模态中的不相关噪声。经过这种纯化之后,我们设计了一个“对比度语义校准”(CSC)模块,通过一种对比度方式来自适应蒸馏目标模态的有用知识,参考样本间的差异化语义相关性。广泛的实验表明,与其他蒸馏方法相比,我们的方法可以在视觉动作识别和视频检索任务中带来性能提升。我们还扩展到音频标签任务来验证我们方法的泛化性。该源代码可在\href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}获得。