The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that transfers knowledge between two types of deep neural networks with different modalities. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation and demonstrate that the proposed method effectively compensates for the shortcomings of the existing label-interpolation-based distillation method. In addition, we extend the proposed method to a hierarchical distillation method using LMs trained in different units (senones, monophones, and subwords) and reveal the effectiveness of the hierarchical distillation method through an ablation study.
翻译:通过自我监督的学习,培训前语言模式(LM)的出色表现导致自然语言处理研究的范式发生重大转变。根据这些变化,利用大量深层学习的LM来发挥语音识别系统的功能,这是语音识别研究的一个主要课题。在对语音识别系统应用LMS的各种方法中,我们在本文件中侧重于一种跨模式知识蒸馏方法,在两种具有不同模式的深层神经网络之间转让知识。我们提出了一个具有多种辅助输出层的声学模型结构,供跨现代蒸馏使用,并表明拟议方法有效地弥补了现有基于标签的蒸馏方法的缺陷。此外,我们将拟议方法推广到使用在不同单元(人脑、单声和子词)培训的LMS的等级蒸馏方法,并通过一项烧蚀研究揭示了等级蒸馏方法的有效性。