Distilled self-supervised models have shown competitive performance and efficiency in recent years. However, there is a lack of experience in jointly distilling multiple self-supervised speech models. In our work, we performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective. On top of that, we proposed a multiple prediction head method for student models to predict different layer outputs of multiple teacher models simultaneously. The experimental results show that our method improves the performance of the distilled models on four downstream speech processing tasks, Phoneme Recognition, Speaker Identification, Emotion Recognition, and Automatic Speech Recognition in the hidden-set track of the SUPERB benchmark.
翻译:近年来,自我监督的蒸馏模型表现出了竞争性的绩效和效率,然而,在联合蒸馏多种自我监督的演讲模型方面缺乏经验。在我们的工作中,我们对各种自我监督的演讲模型(如HuBERT、RobustHuBERT和WavLM)进行了综合知识蒸馏(EKD ) 。我们尝试了两种不同的组合技术,即不同教师模型的分层平均和分层调和,发现前者更加有效。此外,我们提出了学生模型的多重预测头法,以同时预测多个教师模型的不同层产出。实验结果显示,我们的方法改进了四个下游语音处理任务(电话识别、议长识别、情感识别和自动语音识别)的蒸馏模型的性能。</s>