This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
翻译:本文展示了一个流传式的发言者致电自动语音识别(SA-ASR)模型,该模型可以识别“谁在多个人同时发言的情况下讲什么是低潜值的”。我们的模型基于象征性的序列式输出培训(t-SOT),最近建议以流态方式将多讲者演讲进行转录。为了进一步识别发言者的身份,我们提议一个基于编码器的脱代器的发言者嵌入提取器,不仅可以估计非重叠发言中、而且可以估计每个公认象征的发言者代表,不仅来自非重叠发言,而且可以估计重复发言。拟议的发言者嵌入式(t-vector)与t-SOT ASR模型同步提取,使得能够联合执行语音识别(SID)或语音diariz化(SD),同时使用低潜度的多讲者笔录。我们通过使用LibriSpeechMix和LibriCSS Cora公司对A的联合任务的拟议模式进行评估。拟议的模型比前一个流式模型的准确性要高得多,有时甚至显示离州SAR-A的高级结果。