With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of \emph{acoustic domain identification} (ADI) for \emph{speaker diarization}. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than $5\%$ and $8\%$ in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.
翻译:随着多年来多媒体内容的增加,在录音环境中观察到了更多的多样性。当音频处理系统有一个模块在前端识别声域时,它可能会受益。在本文件中,我们展示了用于\emph{speaker diarization} (ADI)的想法。为此,我们首先对第三个DIHARD挑战的各个领域进行详细研究,突出区别它们的因素。我们的主要贡献是为ADI开发一个简单而有效的解决方案。在目前的工作中,我们探索为这项任务嵌入的演讲者。接下来,我们将ADI模块与DIHARD III挑战的演讲者分化框架结合起来。在根据各自的领域优化了聚合性等级组合阈值时,其绩效大大高于基线。我们在DIHARD III评估组第1轨上,核心条件和全部条件分别相对改进了5美元和8美元。