Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance was spoken, provides a rich input signal. We introduce an attention mechanism for training neural speech recognition language models on both text and non-linguistic contextual data. When applied to a large de-identified dataset of utterances collected by a popular voice assistant platform, our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information. When evaluated on utterances extracted from the long tail of the dataset, our method improves perplexity by 9.0% relative over a standard LM and by over 2.8% relative when compared to a state-of-the-art model for contextual LM.
翻译:自动语音识别语言模型(LM)通常不包含语音助理(ASR)的发音级别背景信息。 但是,对于一些领域,如语音助理(ASR),额外背景,如发言时间等,提供了丰富的输入信号。我们引入了对文本和非语言背景数据神经语言识别语言模型的培训关注机制。当应用到由流行语音助理平台收集的大型非识别语句数据集时,我们的方法比不包含背景信息的标准LM减少了7.0%的迷惑性。在评价从数据集长尾部分提取的语句时,我们的方法比标准LM(LM)增加了9.0%的迷惑性,与背景LM(LM)最新模型相比增加了2.8%以上。