The smaller memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models. To obtain a smaller model, one can employ the model compression techniques. Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size with relatively lesser degradation in the model performance. In this approach, knowledge is distilled from a trained large size teacher model to a smaller size student model. Also, the transducer based models have recently shown to perform well for on-device streaming ASR task, while the conformer models are efficient in handling long term dependencies. Hence in this work we employ a streaming transducer architecture with conformer as the encoder. We propose a multi-stage progressive approach to compress the conformer transducer model using KD. We progressively update our teacher model with the distilled student model in a multi-stage setup. On standard LibriSpeech dataset, our experimental results have successfully achieved compression rates greater than 60% without significant degradation in the performance compared to the larger teacher model.
翻译:智能设备中较小的内存带宽可以促进开发较小的自动语音识别(ASR)模型。 要获得一个较小的模型, 可以使用模型压缩技术。 知识蒸馏(KD)是一种流行的模型压缩方法, 它显示在模型性能的降解方面可以实现较小的模型规模。 在这个方法中, 知识从受过训练的大型教师模型蒸馏到规模较小的学生模型。 此外, 以传感器为基础的模型最近显示, 在设计性流传 ASR 任务上表现良好, 而符合性能的模型在处理长期依赖性方面是有效的。 因此, 我们在此工作中采用了一个与编码器相匹配的流式导导导师结构。 我们提出一个多阶段渐进式方法, 用KD来压缩符合的导师模型。 我们用一个多阶段设置的蒸馏式学生模型逐步更新我们的教师模型。 在标准LibriSpeech数据集上, 我们的实验结果成功地实现了超过60%的压缩率, 而没有比较大的教师模型显著退化。