Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
翻译:使用外部语言模型(ELM)进行语音识别的 RNNN- Transporter(RNN-T) 终端到终端使用外部语言模型(ELM) 使用只使用文本的数据具有挑战性。 最近,开发了密度比率(DR)和内部语言模型估计(ILME)等一类方法,这优于经典浅质聚合(SF)方法。 这些方法的基本想法是,RNN-T后继者首先应减去隐含学习的内部语言模型(ILM),以便整合ELM。 尽管最近的研究表明RNN-T只学习一些低级语言模型信息,但DR方法使用全背景的训练有素的神经语言模型,这可能不适合ILM(DR)和内部语言模型估计(ILMEME)的整合性能。根据RDR方法,我们提出一个低级密度密度比率方法(LDRDR),用一个低级的弱度语言模型取代这一估算。在英语里里比和泰里姆(Tedliumium-2)和中国的WnetSDREch(IDRSML)测试中持续进行LADRDL)和最接近的LLLSDRISDLS和中文测试。