An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated and subtracted. In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method is to replace the context vector of the LAS decoder at every time step with a vector that is learned with training transcripts. Furthermore, we propose another method that uses a lightweight feed-forward network to directly map query vector to context vector in a dynamic sense. Since the context vectors are learned by minimizing the perplexities on training transcripts, and their estimation is independent of encoder output, hence the ILMs are accurately learned for both methods. Experiments show that the ILMs achieve the lowest perplexity, indicating the efficacy of the proposed methods. In addition, they also significantly outperform the shallow fusion method, as well as two previously proposed ILM Estimation (ILME) approaches on several datasets.
翻译:一个端到端(E2E) ASR 模型隐含地从培训记录誊本中学习了先前的内部语言模型(ILM) 。 要使用Bayes 后传理论整合外部 LM, 就必须准确估计和减去 ILM 生成的日志概率。 在本文中, 我们提出基于 Liste- Attend- Spell (LAS) 框架估算 ILM 的两种新办法。 第一个方法是用培训记录誊本学习的矢量来取代LAS 解码器的上下文矢量。 此外, 我们提议了另一种方法, 即使用轻量的向向上网络, 直接将查询矢量定位到动态的上下文矢量。 由于上下文矢量通过最大限度地减少培训记录中的混杂度来学习, 其估计独立于编码输出, 因此这两种方法都准确学习了 ILMS 。 实验显示, ILMS 达到最小的易变度, 表明拟议方法的功效。 此外, 它们还明显地超越了一些浅聚方法, 以及先前提议的 ILM Estimtion (ME) 方法。