Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. Code is released here: https://github.com/yanbeic/CCL.
翻译:获得多模式提示(例如视觉和音频)可以使一些认知任务比从单一模式学习更快地完成。 在这项工作中,我们提议通过不同模式转让知识,即使这些数据模式可能不是相互关联的。我们不是直接调整不同模式的表达方式,而是制作音频、图像和视频演示方式,以发现更丰富的多模式知识。我们的主要想法是学习一种组合嵌入,以弥合跨模式的语义差距,捕捉与任务相关的语义学,从而通过形成对比性学习将各种模式的表达方式拉在一起。我们为三个视频数据集(UCF101、ActionNet和VGGSound)建立了一个新的、全面的多模式蒸馏基准。此外,我们证明我们的模型在转让视听知识以改善视频代表学习方面大大超越了现有的各种知识蒸馏方法。代码在这里发布:https://github.com/yanbeic/CCL。