Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context information with a recurrent neural network or long short-term memory language model (LM) may suffer from the recent bias while excluding the long-range context. In order to capture the long-term semantic interactions among words and across utterances, we put forward disparate conversation history fusion methods for language modeling in automatic speech recognition (ASR) of conversational speech. Furthermore, a novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history in a cooperative way. To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM, as the ingredient vehicle to facilitate selection of the oracle hypothesis from a given N-best hypothesis list. Empirical experiments conducted on the AMI benchmark dataset seem to demonstrate the feasibility and efficacy of our methods in relation to some current top-of-line methods. The proposed methods not only achieve significant inference time reduction but also improve the ASR performance for conversational speech.
翻译:先前的工作表明,通过一个经常性神经网络或长期短期记忆语言模型(LM)获取更长期的背景信息可能会因最近的偏差而受到影响,但排除了长距离背景。为了捕捉言语之间和言语之间长期的语义互动,我们提出了不同的对话历史融合方法,用于在谈话演讲的自动语音识别(ASR)中进行语言模拟。此外,还引入了一个新型的音频融合机制,该机制能够以合作的方式整合和利用当前言论的音频嵌入及其相应对话史的语义内容。为了充实我们的想法,我们将ASR的最佳假设组合任务定为预测问题,利用BERT这个具有标志性、经过预先训练的LM,作为便利从给定的最佳假设列表中选取奥妙假设的成分工具。在AMI基准数据集上进行的实证性实验似乎只是展示了我们当前演讲方法的可行性和有效性,但在与当前最高级对话中,我们提出的改进了与当前对话方法的升级方法。