Large neural networks are impractical to deploy on mobile devices due to their heavy computational cost and slow inference. Knowledge distillation (KD) is a technique to reduce the model size while retaining performance by transferring knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language datasets is relatively unexplored and digesting such multimodal information is challenging since different modalities present different types of information. In this paper, we propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets. Existing KD approaches can be applied to multimodal setup, but a student doesn't have access to modality-specific predictions. Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality. Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses; a meta-learning approach to learn the optimal weights on these loss terms. In our experiments, we demonstrate the effectiveness of our MSD and the weighting scheme and show that it achieves better performance than KD.
翻译:大型神经网络由于计算成本高昂和推推速度缓慢,无法在移动设备上部署大型神经网络。知识蒸馏(KD)是一种技术,通过将大型“教师”模型的知识转移到较小的“学生”模型来减少模型规模,同时通过将知识从大型“教师”模型转移到较小的“学生”模型来保持性能。然而,在视觉语言数据集等多式数据集方面,KD相对没有探索,消化这种多式信息具有挑战性,因为不同模式具有不同类型的信息。在本文中,我们建议采用特定模式的蒸馏(MSD),以便有效地从多式数据集的教师那里传授知识。现有的KD方法可以适用于多式联运设置,但学生无法利用特定模式的预测。我们的想法是模拟教师特定模式的预测,对每种模式都采用辅助性损失术语。由于每种模式对于预测都具有不同的重要性,我们还提议了辅助性损失的加权方法;我们建议采用元化学习方法,以了解这些损失术语的最佳重量。我们实验中,我们展示了我们的MSD和加权计划的效力,并显示它比KD取得更好的业绩。