Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next generation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission latency penalty to improve end-point detection for each speaker turn. Finally, we establish a novel framework for segmentation analysis of multi-party conversations through emission latency metrics. With our best model, we report 4.6% abs. turn counting accuracy improvement and 17% rel. word error rate (WER) improvement on LibriCSS dataset compared to the previously published work.
翻译:在这项工作中,我们通过一种新颖的方法,即分隔器-传输器分解器(STS),将语音分离、识别和分解更紧密地整合到单一模式中。首先,我们提出一个新的分解模式战略,通过开始和结束标志,改善分化,而没有识别精确度的降解。第二,我们通过排放正规化方法、快速Emit和多任务培训,进一步提高语音识别和分解准确性,以语音活动信息作为补充培训信号。第三,我们试验末端排放惯性罚款,以改进对每个发言者转弯的终端检测。最后,我们建立一个新的框架,通过排放拉通度度度测量,对多方对话进行分解分析。我们用我们的最佳模型,报告4.6% abs. 将精确度的改进值转换为17 % LibSertread 工作率。