Adaption of end-to-end speech recognition systems to new tasks is known to be challenging. A number of solutions have been proposed which apply external language models with various fusion methods, possibly with a combination of two-pass decoding. Also TTS systems have been used to generate adaptation data for the end-to-end models. In this paper we show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data. By taking advantage of model's inherent structure, where the prediction network is interpreted as a language model, we can apply fast adaptation to the model. Adapting the model avoids the need for complicated decoding time fusions and external language models. Using appropriate regularization, the prediction network can be adapted to new domains while still retaining good generalization capabilities. We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER. We also share insights how RNN-transducer prediction network performs as a language model.
翻译:据知,将终端到终端语音识别系统适应新任务具有挑战性。 已经提出若干解决方案,采用外部语言模型,采用各种组合方法,可能同时使用双通道解码。 还使用了TTS系统为端到终端模型生成适应数据。 在本文中,我们显示,仅使用少量文本数据,RNN-传输器模型可以有效地适应新领域。 通过利用模型的内在结构,在将预测网络解释为语言模型的情况下,我们可以对模型进行快速适应。 调整模型避免了复杂的解码时间聚合和外部语言模型的需要。 使用适当的正规化,预测网络可以适应新的领域,同时仍然保留良好的通用能力。 我们通过多项 ASR 评估任务显示,这种方法如何在目标任务WER中提供10-45%的相对收益。 我们还分享关于RNN- 传输器预测网络如何作为语言模型运行的见解。