Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-device speech recognition. In this work, we propose a simple and effective approach to reduce the size of the two-pass model for memory-constrained devices. We employ a popular knowledge distillation approach in three stages using the Teacher-Student training technique. In the first stage, we use a trained RNN-T model as a teacher model and perform knowledge distillation to train the student RNN-T model. The second stage uses the shared encoder and trains a LAS rescorer for student model using the trained RNN-T+LAS teacher model. Finally, we perform deep-finetuning for the student model with a shared RNN-T encoder, RNN-T decoder, and LAS rescorer. Our experimental results on standard LibriSpeech dataset show that our system can achieve a high compression rate of 55% without significant degradation in the WER compared to the two-pass teacher model.
翻译:由于记忆力小,智能装置的语音识别具有挑战性,因为记忆力小,因此小型的ASR模式是可取的。由于使用流行的基于导师的模型,因此有可能在小型装置上实际部署流言识别模型[1]。最近,将RNN-T和LAS模块相结合的双通模式[2]最近,将RNN-T和LAS模块组合起来的双通模式[2]展示了在线语音识别的特殊性能。在这项工作中,我们提出了一个简单而有效的方法,以缩小记忆力控制装置双通模式的大小。我们使用师资培训技术,在三个阶段采用流行的知识蒸馏方法。在第一阶段,我们使用训练有素的RNNNT-T模型作为教师模型,并进行知识蒸馏,以培训学生RNNNT-T模型。第二阶段使用共享的编码,并培训学生模型的LASRector。最后,我们用一个共享的 RNNNT-T encoder 和 LAS Rescorer 模型,在三个阶段使用共同的RNNNNNNNT 和LS-T 师的共享的学习模型进行深丝模型进行深调。我们的实验性测试结果,在不高标准系统上,在不进行高压下,在高压中,在高压中,在高压下,在高压下,在高压下,我们标准的LS-LS-R-LS-LS-LS-RS-S 上,在高压中,在高压中,在高压中进行。