For online speaker diarization, samples arrive incrementally, and the overall distribution of the samples is invisible. Moreover, in most existing clustering-based methods, the training objective of the embedding extractor is not designed specially for clustering. To improve online speaker diarization performance, we propose a unified online clustering framework, which provides an interactive manner between embedding extractors and clustering algorithms. Specifically, the framework consists of two highly coupled parts: clustering-guided recurrent training (CGRT) and truncated beam searching clustering (TBSC). The CGRT introduces the clustering algorithm into the training process of embedding extractors, which could provide not only cluster-aware information for the embedding extractor, but also crucial parameters for the clustering process afterward. And with these parameters, which contain preliminary information of the metric space, the TBSC penalizes the probability score of each cluster, in order to output more accurate clustering results in online fashion with low latency. With the above innovations, our proposed online clustering system achieves 14.48\% DER with collar 0.25 at 2.5s latency on the AISHELL-4, while the DER of the offline agglomerative hierarchical clustering is 14.57\%.
翻译:此外,在大多数现有基于集群的方法中,嵌入提取器的培训目标不是专门为集群而设计的。为了改进在线发言者的diariz化性能,我们提议了一个统一的在线集成框架,为嵌入提取器和组群算法提供互动的方式。具体地说,该框架由两个高度结合的部分组成:集群引导的经常性培训(CGRT)和短径的波段搜索群集(TBSC)。CGRT将集成算法引入嵌入提取器的培训过程,不仅可以为嵌入提取器提供集成意识信息,而且为随后的集群进程提供关键参数。这些参数包含初步的计量空间信息,TBSC惩罚了每个组群的概率分数,以便以低液态的在线方式输出更准确的组合结果。通过上述创新,我们提议的在线集成系统实现了14.48°DER,在AISELL 4 的2.5 的悬浮度为0.25,而DER是磁层的离层结构。