Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are motivated to distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We propose a novel method called the Distilling a Language model to a Speech model (Distill-L2S), which aligns the latent representations of two different modalities. The subtle differences are handled by the shrinking mechanism, nearest-neighbor interpolation, and a learnable linear projection layer. We demonstrate the effectiveness of our distillation method by applying it to the multilingual automatic speech recognition (ASR) task. We distill the transformer-based cross-lingual language model (InfoXLM) while fine-tuning the large-scale multilingual ASR model (XLSR-wav2vec 2.0) for each language. We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
翻译:多语言语言数据往往受到长尾语言分布的影响,导致性能退化。然而,多语种文本数据更容易获得,从而产生一个更有用的通用语言模型。因此,我们有志于将学生演讲模式中包含的丰富知识蒸馏成一个训练有素的教师文本模型,我们提出一种叫“将语言模型蒸馏成一个语言模型”的新颖方法,该模型将两种不同模式的潜在表现方式相匹配。微妙的差异由缩小机制、近邻互调和可学线性投影层处理。我们通过将这种方法应用于多语言自动语音识别(ASR)任务,展示了我们蒸馏方法的有效性。我们将基于变压器的跨语言模型(InfoXLM)提炼,同时微调每种语言的大规模多语言ASR模型(XLSR-wav2vec 2.0),我们用不到100小时的语音数据来显示我们使用通用语音数据集20种低资源语言的方法的优越性。