Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context information with a recurrent neural network or long short-term memory language model (LM) may suffer from the recent bias while excluding the long-range context. In order to capture the long-term semantic interactions among words and across utterances, we put forward disparate conversation history fusion methods for language modeling in automatic speech recognition (ASR) of conversational speech. Furthermore, a novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history in a cooperative way. To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM, as the ingredient vehicle to facilitate selection of the oracle hypothesis from a given N-best hypothesis list. Empirical experiments conducted on the AMI benchmark dataset seem to demonstrate the feasibility and efficacy of our methods in relation to some current top-of-line methods.
翻译:先前的工作表明,通过一个经常性神经网络或长期短期记忆语言模型(LM)获取更长期的背景信息可能会因最近的偏差而受到影响,但排除了长距离背景。为了捕捉言语之间和言语之间长期的语义互动,我们提出了不同的对话历史融合方法,用于在谈话演讲的自动语音识别(ASR)中进行语言建模。此外,还引入了一个新型的音频聚合机制,该机制能够以合作的方式整合和利用当前语句的音频嵌入及其相应对话历史的语义内容。为了充实我们的想法,我们把ASR的最佳假设拼凑任务作为预测问题,利用BERT这一具有标志性的预培训程度的LM,作为便利从给定的最佳假设列表中选择骨架假设的成分工具。在AMI基准数据集上进行的实验似乎展示了我们方法与当前顶端方法的可行性和效力。