Speaker diarization algorithms address the "who spoke when" problem in audio recordings. Algorithms trained end-to-end have proven superior to classical modular-cascaded systems in constrained scenarios with a small number of speakers. However, their performance for in-the-wild recordings containing more speakers with shorter utterance lengths remains to be investigated. In this paper, we address this gap, showing that an attractor-based end-to-end system can also perform remarkably well in the latter scenario when first pre-trained on a carefully-designed simulated dataset that matches the distribution of in-the-wild recordings. We also propose to use an attention mechanism to increase the network capacity in decoding more speaker attractors, and to jointly train the attractors on a speaker recognition task to improve the speaker attractor representation. Even though the model we propose is audio-only, we find it significantly outperforms both audio-only and audio-visual baselines on the AVA-AVD benchmark dataset, achieving state-of-the-art results with an absolute reduction in diarization error of 23.3%.
翻译:演讲人 diarization 算法针对的是“ 何时发言” 录音中的“ 何时发言” 问题。 演算法经过培训的终端到终端都证明优于传统的模块组合组合系统,在有少数发言者的有限情况下,这种组合组合式组合式组合式系统。 但是,它们对于含有更多长话短话的发言者的现场录音的性能仍有待调查。 在本文中,我们处理这一差距,表明一个基于吸引人的终端到终端系统在后一种情况下也可以表现得非常好。 当最初对精心设计的模拟数据集进行预先培训,该模拟数据集与网上录音的分布相匹配时,我们还提议使用关注机制来提高网络解码更多扬声器的功能,并联合培训吸引人进行语音识别任务,以改进吸引人的代表。 尽管我们提出的模型是只听音的,但我们发现它大大超出AVA-AVA-AVD基准数据集的音频基线和视听基线,从而取得与23.3%的分辨误差绝对减少的状态结果。