This paper proposes an online end-to-end diarization that can handle overlapping speech and flexible numbers of speakers. The end-to-end neural speaker diarization (EEND) model has already achieved significant improvement when compared with conventional clustering-based methods. However, the original EEND has two limitations: i) EEND does not perform well in online scenarios; ii) the number of speakers must be fixed in advance. This paper solves both problems by applying a modified extension of the speaker-tracing buffer method that deals with variable numbers of speakers. Experiments on CALLHOME and DIHARD II datasets show that the proposed online method achieves comparable performance to the offline EEND method. Compared with the state-of-the-art online method based on a fully supervised approach (UIS-RNN), the proposed method shows better performance on the DIHARD II dataset.
翻译:本文件建议采用在线端对端二分法,处理发言重叠和发言者人数灵活的问题。与传统的集群方法相比,端对端神经发言者二分法(END)模式已经取得了显著的改进。然而,原EEND有两个限制:(1) EEND在网上设想方案方面表现不佳;(2) 发言者人数必须事先固定。本文件采用经修改的语音缓冲法扩展适用于不同人数的发言者。CALHOME和DIHARD II数据集实验显示,拟议的在线方法取得了与离线 EEND方法相似的业绩。与基于完全监督方法(UIS-RNN)的最新在线方法(UIS-RNN)相比,拟议方法显示了DIHARD II数据集的更好表现。