In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the self-attentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect.
翻译:近年来,发言者的diarization吸引了广泛的关注。为了取得更好的表现,一些研究建议在多个阶段中将语言分化。虽然这些方法可能会带来额外的好处,但大多数方法都相当复杂。在本文中,我们以自动语音识别(ASR)中的拼写修正为动力,提议了一个端到端错误校正框架,称为DiaCorrect,以简单而高效的方式完善最初的diarization结果。通过利用输入混合物及其相应的语音活动之间的声学互动,DiaCorrect可以自动调整最初的语音活动,以尽量减少diariz化错误。在没有钟声和哨声的情况下,关于LibriSpeech的实验以2个语音会议类似的数据显示,与DiaCorrect(SA-EEND)的自惯端到端神经二极化(SA-END)基线可以将其diarization错误率降低62.4%以上,从12.31 %降至4.63%。我们的源代码可在https://github.com/jyhan03/diarecoal上查阅。