Many modern systems for speaker diarization, such as the recently-developed VBx approach, rely on clustering of DNN speaker embeddings followed by resegmentation. Two problems with this approach are that the DNN is not directly optimized for this task, and the parameters need significant retuning for different applications. We have recently presented progress in this direction with a Leave-One-Out Gaussian PLDA (LGP) clustering algorithm and an approach to training the DNN such that embeddings directly optimize performance of this scoring method. This paper presents a new two-pass version of this system, where the second pass uses finer time resolution to significantly improve overall performance. For the Callhome corpus, we achieve the first published error rate below 4\% without any task-dependent parameter tuning. We also show significant progress towards a robust single solution for multiple diarization tasks.
翻译:发言人二分化的许多现代系统,例如最近开发的 VBx 方法,依靠DNN 语言嵌入组合,然后进行分解。这个方法有两个问题:DNN不是直接优化这项任务,参数需要为不同的应用进行重大调整。我们最近提出了朝这个方向取得的进展,采用了“一输出”Gaussian PLDA(LGP)组合算法,以及培训DNN 的方法,这种方法直接嵌入了这一评分方法的优化性能。本文展示了这个系统的新双向版本,即第二次通过精细时间分辨率来显著改善总体性能。对于Callhome Paprocure,我们实现了第一个公布在4 ⁇ 以下的误差率,而没有进行任何依赖任务的参数调整。我们还展示了在为多重分化任务找到稳健的单一解决方案方面取得的重大进展。