Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points. Existing SCD solutions either require additional ensemble for the time based decisions and recognized word sequences, or implement a tight integration between ASR and SCD, limiting the potential optimum performance for both tasks. To address these issues, we propose a novel framework for the SCD task, where an additional SCD module is built on top of an existing Transformer Transducer ASR (TT-ASR) network. Two variants of the SCD network are explored in this framework that naturally estimate speaker change probability for each word, while allowing the ASR and SCD to have independent optimization scheme for the best performance. Experiments show that our methods can significantly improve the F1 score on LibriCSS and Microsoft call center data sets without ASR degradation, compared with a joint SCD and ASR baseline.
翻译:语音变换检测(SCD)是一个重要特征,它通过在发言者变换点将单词序列破解成段落,提高了自动语音识别系统(ASR)中得到承认的单词的可读性。现有的语言变换解决方案要么要求为基于时间的决定和得到承认的单词序列增加组合,要么在ASR和SCD之间实施严格的整合,限制这两项任务的潜在最佳性能。为了解决这些问题,我们为SCD任务提出了一个新的框架,在现有的变换器转换器 ASR(TT-ASR)网络之上再建一个SCD模块。在此框架内探索了SCD网络的两种变体,自然估计了每个单词的发音概率,同时允许ASR和SCD为最佳性能制定独立的优化计划。实验表明,与SCD和ASR联合基线相比,我们的方法可以大大改进LibriCSS和微软调用中心数据集的F1分数,而不会出现ASR退化。