End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.
翻译:端对端自动语音识别(E2E)隐含地学习配对音标培训数据的象征性序列分布。然而,它仍然受到从培训到测试的域变化的影响,而且域适应仍然具有挑战性。为了缓解这一问题,本文设计了一个可替代的内部语言模型(RILM)方法,使在遇到域变换时,在解码阶段直接以目标-主域LM取代E2E自动语音识别模型的内部语言模型(LM)成为可行。此外,本文件提议为基于四氯化碳的E2E ASR模型设计一个剩余软模(R-软模),以适应目标区域变换,而无需在推断过程中进行再培训。关于LibriSpeechpos的E2E ASR模型,实验表明,拟议的方法使总机数据绝对减少2.6%的WER,使AESRC2020运动的AESRC2020-WER减少1.0%,同时保持内部的ASR结果。</s>