Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data without modifying the model structure. Since the semantic and granularity gap between text and speech has been omitted in literature, which impairs the distillation, we propose the Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages text/speech units of variable granularity and prior distributions to achieve better global and local alignments between text and speech pre-trained models. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
翻译:大量语音材料的学习导致许多自我监督的语音模型最近取得了成功。随着知识蒸馏,这些模型还可能受益于在丰富的文本源上经过预先培训的语文模型所编码的知识。然而,由于文本和语音嵌入空间之间的模式差异,蒸馏过程具有挑战性。本文研究基于标准的蒸馏方法,以使文本和语音的嵌入空间与少量数据相匹配,而不改变模型结构。由于文字和语言之间的语义和颗粒差距在文献中被省略,从而妨碍蒸馏,因此我们建议采用事先知情的适应性知识蒸馏法(PAD),以适应性方式利用可变颗粒和先前分发的文本/声音单元,使文本和语言预培训模型之间实现更好的全球和地方一致。我们评估三种口头语言理解基准,以表明PAD在语言知识转让方面比其他基于指标的蒸馏方法更有效。</s>