In speaker diarisation, speaker embedding extraction models often suffer from the mismatch between their training loss functions and the speaker clustering method. In this paper, we propose the method of spectral clustering-aware learning of embeddings (SCALE) to address the mismatch. Specifically, besides an angular prototype cal (AP) loss, SCALE uses a novel affinity matrix loss which directly minimises the error between the affinity matrix estimated from speaker embeddings and the reference. SCALE also includes p-percentile thresholding and Gaussian blur as two important hyper-parameters for spectral clustering in training. Experiments on the AMI dataset showed that speaker embeddings obtained with SCALE achieved over 50% relative speaker error rate reductions using oracle segmentation, and over 30% relative diarisation error rate reductions using automatic segmentation when compared to a strong baseline with the AP-loss-based speaker embeddings.
翻译:在发言者的分化中,发言者嵌入的提取模型往往因其培训损失功能与发言者集群方法之间的不匹配而受到影响。在本文中,我们建议了光谱聚集感学嵌入器(SCALE)的学习方法,以解决不匹配问题。具体地说,除了一个角原型卡路(AP)损失外,SCALE还使用一种新的亲和矩阵损失,直接将发言者嵌入估计的亲和矩阵与参考之间的误差最小化。SCALE还将P-百分点阈值和高西亚模糊作为培训中光谱集群的两个重要超参数。关于AMI数据集的实验显示,与SACALE的嵌入器的发言者在使用或触角分割时实现了超过50%的相对喇和超过30%的相对分化误率的减少,在与以AP-亏损为主的发言者嵌入的强基线相比时使用了30%的自动分解。