End-2-end (E2E) models have become increasingly popular in some ASR tasks because of their performance and advantages. These E2E models directly approximate the posterior distribution of tokens given the acoustic inputs. Consequently, the E2E systems implicitly define a language model (LM) over the output tokens, which makes the exploitation of independently trained language models less straightforward than in conventional ASR systems. This makes it difficult to dynamically adapt E2E ASR system to contextual profiles for better recognizing special words such as named entities. In this work, we propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities. We apply the aforementioned technique to an E2E ASR system, which transcribes doctor and patient conversations, for better adapting the E2E system to the names in the conversations. Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set. Moreover, it also surpasses a contextual shallow fusion baseline by 22.1 % relative.
翻译:在一些ASR任务中,终端-2终端(E2E)模型因其性能和优势而越来越受欢迎。这些E2E模型直接接近了声学输入的标志的事后分布。因此,E2E系统隐含地定义了一种语言模型,而不是输出符号,这使得利用独立培训的语言模型不如传统的ASR系统简单。这使得难以动态地将E2E ASR系统调整到背景概况中,以更好地识别诸如被命名实体等特殊词。在这项工作中,我们提出了一种背景密度比法,用于培训一个有背景的E2E模型和将语言模型调整到指定实体。我们将上述技术应用到E2E ASR系统,该系统为医生和病人交谈进行转录,以便更好地使E2E系统适应对话中的名称。我们提议的技术在不降低整个测试集的总体识别准确性的前提下,在E2E基线上对名称进行了高达46.5%的相对改进。此外,我们提出的技术也超过了背景浅聚基线22.1%的相对值。