This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.
翻译:本文展示了一个新的流动自动语音识别框架(ASR ), 用于由远方麦克风阵列以任意几何测量的任意几何方式捕获的多对讲器重叠语音。 我们的框架名为 t-SOT-VA, 利用独立开发的两种最新技术; 阵列地球测量学连续语音分离, 或VarArray, 以及基于象征性的序列化输出培训(t-SOT) 流传多对讲器 ASR 。 为了将两种技术的最佳组合起来, 我们新设计了一个基于 t-SOT的ASR 模型, 产生基于VarArray的两个分隔的语音信号的连续多对讲解器抄录。 我们还提议了一个ASR 模式的预培训方案, 用于模拟Varararray的产出信号, 以单对单对讲机 ASR 培训数据进行模拟。 使用AMI 会议文体的对比实验显示, 以拟议框架为基础的系统大大超出常规技术。 我们的系统实现了13.7%和15.5%的状态单词错误率, 在AMI开发和评价中分别保持了多式能力。