Large-scale pre-trained language models (PLMs) with powerful language modeling capabilities have been widely used in natural language processing. For automatic speech recognition (ASR), leveraging PLMs to improve performance has also become a promising research trend. However, most previous works may suffer from the inflexible sizes and structures of PLMs, along with the insufficient utilization of the knowledge in PLMs. To alleviate these problems, we propose the hierarchical knowledge distillation on the continuous integrate-and-fire (CIF) based ASR models. Specifically, we distill the knowledge from PLMs to the ASR model by applying cross-modal distillation with contrastive loss at the acoustic level and applying distillation with regression loss at the linguistic level. On the AISHELL-1 dataset, our method achieves 15% relative error rate reduction over the original CIF-based model and achieves comparable performance (3.8%/4.1% on dev/test) to the state-of-the-art model.
翻译:具有强大语言模型的大型预先培训语言模型(PLM)在自然语言处理中被广泛使用。对于自动语音识别(ASR)而言,利用PLM来提高性能也已成为一个有希望的研究趋势。然而,以往的多数工作可能由于PLM的不灵活规模和结构而受到影响,同时对PLM知识的利用不足。为了缓解这些问题,我们建议对基于ASR(CIF)的连续集成和射击模型进行分级知识蒸馏。具体地说,我们通过在声学上应用与声学损失对比的跨式蒸馏法,将PLM的知识从PLM到ASR模型中提取,并在语言上与回归损失进行蒸馏。在ASHELL-1数据集中,我们的方法比原CIF模型减少15%的相对误差率,并实现与最新模型的类似性能(3.8%/4.1%(dev/test))。