Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
翻译:发言人的diarization是一项任务,将音频或视频录音与符合发言者身份或简而言之的班级贴上标签,确定“谁在何时发言”的任务。在最初几年,发言者的diarization 算法是为多声频录音的语音识别而开发的,以使发言者能够适应处理。这些算法也逐渐获得其自身的价值,作为独立应用,为诸如音频检索等下游任务提供具体针对发言者的元信息。最近,随着深层次的学习技术的出现,驱动了不同语音应用领域的研究和作法的革命性变化,发言者的diariz化工作迅速取得进展。在本文件中,我们不仅审查了发言者的diarization技术的历史发展,而且还审查了神经音频喇叭diarization方法的最新进展。此外,我们讨论了如何将发言者的diarization系统与语音识别应用结合起来,以及最近深层学习的激增如何导致共同建模这两个组成部分相互补充。通过考虑这种令人振奋人心的技术趋势,我们认为,这份文件对社区提供了宝贵的贡献,通过用神经技术方法巩固最近的发展,从而进一步推进更有效率的dial dialization。