Utilizing text-only data with an external language model (LM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and ILM estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned ILM prior, in order to integrate the external LM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained ILM. We hypothesize that this setting is appropriate and may deteriorate the performance of the DR method, and propose a low-order density ratio method (LODR) by training a low-order weak ILM for DR. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
翻译:在终端到终端 RNNN- Transporter (RNN-T) 语音识别方面,利用外部语言模型(LM) 的文本数据使用只读文本数据是一项挑战性的工作。 最近,开发了密度比率和ILM估计(ILME)等一类方法,优于传统的浅质聚合(SF)方法。这些方法的基本想法是,RNN-T 后继者应首先减去以前隐含的ILM, 以便整合外部LM。 尽管最近的研究表明,RNN-T只学习一些低级语言模型信息,但DR方法使用训练有素的ILM。 我们假设这一环境是合适的,可能会恶化DR方法的性能,并通过为DR培训低级弱度的ILM(LDR) 提出低级密度比率方法。 在英语 LibSpeech & Tedlium-2 和中国的WnetSpeech & AISHELL-1 和中国的WEWANPer 和AISELL-1 数据集中, 通常在LSDRSDR 进行更接近的测试时, 显示所有LSDRDRSDR 和最接近IDRSDRSDR 。