Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.
翻译:语言模型(LM)融合法要求在推断期间增加外部 LM, 即不包括编码器的 E2E 组件, 以最大限度地减少跨编程损失。 要克服这一点, 我们提议使用仅文本数据对 E2E 模型进行内部LM 调整(ILMA) 。 使用音频- 平面配对培训, E2E 模型隐含地学习了内部LM, 象征序列概率的特征是E2E 模型输出在去除编码器贡献后所近似于 E2E 模型的数值。 在 ILMA 期间,我们微调内部 LM, 即不包括编码器的 E2E2 E2 E 组件, 以最大限度地减少跨编程损失。 要使 I2EMA 模型与标准 E2E2E 损失相比, 以内部LM 损失来培训内部LM 。 此外,我们提议通过尽可能减少 Kullback- Lebell 错误, 将调制的IMA 和未调制的IMs 的不适应的缩略图型模型之间的输出差。