End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.
翻译:端到端语音识别模型通过与外部文本源融合(通常与外部语言模型融合)来进行改进。这些语言模型必须在感兴趣的语料库发生变化时重新训练。此外,由于它们的参数存储了整个语料库,因此罕见词汇的召回可能具有挑战性。在本文中,我们提出增加检索语言模型以协同一个基于 Transducer 的 ASR 模型,该模型直接从外部文本语料库中检索出部分 ASR 假设的合理补充。然后,这些补充将由一个适配器一次训练并在之后的预测中进行整合,因此可以在不重新训练的情况下切换感兴趣的语料库,从而避免了计算方面的负担。我们的实验表明,所提出的模型显著提高了一对问答数据集的转录器基线的表现。此外,在命名实体识别方面,它的表现比浅层融合提高了约7个百分点。当两者结合时,相对改进幅度增加到了13%。