Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective.
翻译:发言人嵌入提取器(EEs)是将音频输入到发言者的分辨潜在空间的地图,在发言者的分辨中具有至关重要的意义,然而,在采用EEs进行分解时,存在着若干挑战,我们从中可以解决两个关键问题。首先,评价并非直截了当,因为提高音效所需的特征在发言者的核实和分解之间有差异。我们表明,在广泛采用的发言者核查评价协议上,更好的表现不会导致更好的分解性表现。第二,嵌入提取器没有看到有多个发言者的发音。这些投入不可避免地出现在发言者的分解中,因为发言和发言者的更改相互重叠;这些投入会降低性能。为了缓解第一个问题,我们生成了更好地模拟分辨假设情景的发言者核查程序。我们提出了两种数据增强技术来缓解第二个问题,使嵌入式提取器意识到发言或发言者更改输入内容重叠。一种技术产生重复的语音部分,而另一技术产生两个发言者按顺序发言的部分。使用三种状态的发言者嵌入式提取器进行广泛的实验结果,表明这两种方法都有效。