Goal-oriented conversational interfaces are designed to accomplish specific tasks and typically have interactions that tend to span multiple turns adhering to a pre-defined structure and a goal. However, conventional neural language models (NLM) in Automatic Speech Recognition (ASR) systems are mostly trained sentence-wise with limited context. In this paper, we explore different ways to incorporate context into a LSTM based NLM in order to model long range dependencies and improve speech recognition. Specifically, we use context carry over across multiple turns and use lexical contextual cues such as system dialog act from Natural Language Understanding (NLU) models and the user provided structure of the chatbot. We also propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time. Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
翻译:面向目标的谈话界面旨在完成特定任务,通常具有互动性,这些互动往往会跨越多个转折,遵守预先确定的架构和目标。然而,自动语音识别系统中的常规神经语言模型(NLM)大多在有限的背景下经过培训,而且内容有限。在本文中,我们探索了将上下文纳入基于LSTM的NLM的不同方法,以模拟长距离依赖性并改进语音识别。具体地说,我们使用上下文的跨圈,并使用词汇背景提示,例如来自自然语言理解模型的系统对话动作和用户提供的聊天室结构。我们还提出了一个新结构,利用BERT在引文时间提供的样本发音量中产生的环境嵌入。我们的实验显示,在面向目标的音频数据集上,在非通俗的发音级别 NLM 中,对非发音级别 NLM 的词误差率相对减少7%。