For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.
翻译:多年来,基于 i-Victor 的音频嵌入技术是发言者校验和音频dial化应用的主要方法。然而,随着不同领域深层学习的兴起,基于神经网络的音频嵌入技术(又称D-Victors)始终展示出高音频的校验性能。在本文中,我们利用基于 d-Victor 的音频验证系统的成功开发出一个新的基于 d-Victor 的音频分化方法。具体地说,我们把基于 LSTM 的 d-Victor 的音频嵌入与最近在非参数组合中的工作结合起来,以获得一个最新水平的音频diar化系统。我们用三个标准公共数据集对我们的系统进行了评估,这表明基于 d-Victor 的对立系统比基于传统 i-Victor 的系统有很大的优势。我们在 NIST SRE 2000 ACEHOME 上实现了12.0 dicalation错误率的12.0, 而我们的模型则通过语音搜索日志的外部数据来培训。