This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNN-T architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data.
翻译:这项工作研究蒸馏技术(KD), 并解决对经常神经网络转换器(RNNN-T)模型的制约。 在硬蒸馏中, 教师模型将大量无标签的言语添加成教师模型。 软蒸馏是另一种受欢迎的KD方法, 蒸馏教师模型的输出日志。 由于RNN-T的校正性质, 应用具有不同后传分布的RNNN-T结构之间的软蒸馏技术, 具有不同后传分布的软蒸馏技术是具有挑战性的。 此外, 坏教师的高字色率(WER) 降低了KD 的功效。 我们研究如何有效地提炼来自不同质量的ASR教师的知识, 而我们以前从未研究过这种知识。 我们展示了一种序列级的KD, 完全蒸馏, 优于其他RNNNT-T模型的蒸馏方法, 特别是坏教师。 我们还提出一个全色蒸馏的变式, 以强化教师的分级知识, 导致WER 的进一步改进。 我们进行了关于公共数据系统Li和S-Setch 数据的实验, 即公共数据系统Listret-S-S-S-S-Sets。</s>