Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework. The highlights of the proposed model are twofold. First, a latent variational module (LVM) is attached to a conformer-based encoder-decoder ASR backbone to learn role preference and topical coherence. Second, a topic model is specifically adopted to bias the outputs of the decoder to words in the predicted topics. Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate (CER) reduction.
翻译:谈话自动语音识别(ASR)是一项承认包括多位发言者在内的谈话性演讲的任务,与句级ASR不同,对话性ASR自然会从谈话的具体特点(例如角色偏好和主题一致性)中获益,本文件提议了一个对话性ASR模式,在普遍的端至端神经框架下明确学习谈话性语言特征。拟议模式的要点有双重。首先,潜伏变异模块(LVM)附在一个基于校对方的编码器-解码器 ASR主干线上,以学习角色偏好和主题一致性。第二,专门采用一个专题模型,将解译器的产出偏向预测主题的文字。对曼达林对话性ASR的两项任务进行的实验显示,拟议的模式实现了最高12%相对性差率的减幅。