The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data. Since existing datasets and well-trained models are primarily unimodal, the modality gap between a unimodal network and unlabeled multimodal data poses an interesting problem: how to transfer a pre-trained unimodal network to perform the same task on unlabeled multimodal data? In this work, we propose multimodal knowledge expansion (MKE), a knowledge distillation-based framework to effectively utilize multimodal data without requiring labels. Opposite to traditional knowledge distillation, where the student is designed to be lightweight and inferior to the teacher, we observe that a multimodal student model consistently denoises pseudo labels and generalizes better than its teacher. Extensive experiments on four tasks and different modalities verify this finding. Furthermore, we connect the mechanism of MKE to semi-supervised learning and offer both empirical and theoretical explanations to understand the denoising capability of a multimodal student.
翻译:多式联运传感器的普及和互联网的可及性给我们带来了大量无标签的多式联运数据。由于现有的数据集和训练有素的模型主要是单式的,单式网络和无标签的多式联运数据之间的模式差距带来了一个有趣的问题:如何转让一个经过预先训练的单式网络来对无标签的多式联运数据执行同样任务?在这项工作中,我们建议扩大多式知识(MKE),这是一个知识蒸馏框架,以有效利用多式联运数据而无需贴标签。在传统知识蒸馏中,学生被设计为轻度和劣于教师,我们观察到,多式学生模型一贯使用假标签,并比教师更概括化。关于四项任务和不同模式的广泛实验证实了这一发现。此外,我们把多式知识网络与半监督的学习机制联系起来,并提供经验与理论解释,以了解多式学生的消化能力。