Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
翻译:以变换器为基础的模型在自动语音识别(ASR)任务中显示了其效力,甚至展示了优于常规混合框架的超强性能。变换器的主要想法是在自我注意层的发声中捕捉到长距离全球背景。然而,对于诸如谈话演讲等情景,这种发声层次的建模将忽视跨语义的背景依赖性。在本文中,我们提议在基于终端到终端的变换器结构中明确建模用于语音识别的流传信息。具体地说,对于编码器网络,我们捕捉了以前讲话的背景,并将这种历史信息纳入当前输入中,通过一个有环境意识的残余关注机制。对于变换器,对当前发音的预测也取决于通过一个有条件的解码框架的历史性语言信息。我们展示了我们关于若干开放源对话公司的拟议方法的有效性,以及拟议方法持续改进了基于超音级变异器的ASR模型的性能。