On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.
翻译:设备上语音识别需要不同尺寸的培训模型,用于在有各种计算预算的装置上部署。 当建立这些不同的模型时,我们可以从联合培训中获益,以便利用它们之间分享的知识。 联合培训也很有效, 因为它减少了培训程序数据处理操作的冗余。 我们提出一个方法, 用于合作培训不同尺寸的声学解码员进行语音识别。 我们使用一个序列转换器设置, 不同的声学解码员共享一个共同的预测器和组合器模块。 声学解码器也通过框架级琴酮预测的辅助任务, 以及传感器损失, 来进行共同蒸馏培训。 我们使用LibriSpeech 程序进行实验, 并证明经过协作培训的声学解解码器可以使测试分区的文字错误率提高11%。