Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multi-speaker audio recordings to enable speaker adaptive processing, but also gained its own value as a stand-alone application over time to provide speaker-specific meta information for downstream tasks such as audio retrieval. More recently, with the rise of deep learning technology that has been a driving force to revolutionary changes in research and practices across speech application domains in the past decade, more rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. We also discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that it is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress towards a more efficient speaker diarization.
翻译:发言人的diarization 算法在最初几年里不仅为多声音录音的语音识别开发了语言diarization 算法,以使发言者能够适应性处理,而且还在一段时间里取得了其自身的单独应用价值,以便为音频检索等下游任务提供针对发言者的特定元信息。最近,随着深层次学习技术的兴起,成为过去十年来语音应用领域研究和做法革命性变化的驱动力,对发言者的diarization有了更迅速的进展。在本文件中,我们不仅审查了发言者的diarization技术的历史发展,而且审查了神经扩音器diarization方法的最新进展。我们还讨论了发言者的diarization系统如何与语音识别应用程序相结合,以及最近深层学习的激增如何导致联合模拟这两个组成部分,以相互补充。我们考虑到这种令人振奋人心的技术趋势,我们认为,通过以神经扩音器巩固最近的发展,从而促进进一步的进展,对社区提供了宝贵的贡献。