We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.
翻译:我们提出一个简单而有效的方法,通过众所周知的知识蒸馏模式压缩RNN- Transporter(RNN-T) 。 我们表明, 转换器的编码器输出自然具有很高的灵敏度, 并含有声学上相似的字形混乱的丰富信息 。 这种丰富的信息在与低温的解码器输出结合以生成联合网络日志时被抑制 。 因此, 我们引入了一种辅助性损失, 从教师的转换器的编码器编码器中蒸馏编码器记录, 并探索该编码器蒸馏工作有效的培训策略 。 我们发现, 对教师和学生的编码器进行同步培训, 与一个内置的编码器蒸馏器相比, 使用一个受过预先训练的静态教师传感器。 我们还报告了一种有趣的现象, 我们称之为隐含的蒸馏器, 当教师和学生的编码器共享相同的解译器时, 就会发生这种现象。 我们的实验显示, 在室内测试器中, 5.37-8.4% 相对字差率降幅(WERRRR) 和5.05-6.