Recently, we proposed a novel speaker diarization method called End-to-End-Neural-Diarization-vector clustering (EEND-vector clustering) that integrates clustering-based and end-to-end neural network-based diarization approaches into one framework. The proposed method combines advantages of both frameworks, i.e. high diarization performance and handling of overlapped speech based on EEND, and robust handling of long recordings with an arbitrary number of speakers based on clustering-based approaches. However, the method was only evaluated so far on simulated 2-speaker meeting-like data. This paper is to (1) report recent advances we made to this framework, including newly introduced robust constrained clustering algorithms, and (2) experimentally show that the method can now significantly outperform competitive diarization methods such as Encoder-Decoder Attractor (EDA)-EEND, on CALLHOME data which comprises real conversational speech data including overlapped speech and an arbitrary number of speakers. By further analyzing the experimental results, this paper also discusses pros and cons of the proposed method and reveals potential for further improvement.
翻译:最近,我们提出了一种新颖的发言者二分化方法,称为“终端到终端-神经-诊断-矢量群集”(EEND-矢量群集),将基于集群和终端到终端神经网络的二分化方法纳入一个框架,提议的方法结合了两个框架的优点,即基于EEND的高分化性能和处理重叠的演讲,以及根据基于集群方法的任意人数对长篇录音进行严格处理;然而,这种方法仅根据模拟的2位发言者会议类似数据进行过评估,本文是(1) 报告我们对这一框架最近取得的进展,包括新引入的稳健集束算法,(2) 实验性地表明,该方法现在大大超越了Encoder-Decoder Atractor(EDA)-END等竞争性二分化方法的优点,该方法包含真实的语音数据,包括重叠的演讲和任意的发言者人数。