Probabilistic Linear Discriminant Analysis (PLDA) was the dominant and necessary back-end for early speaker recognition approaches, like i-vector and x-vector. However, with the development of neural networks and margin-based loss functions, we can obtain deep speaker embeddings (DSEs), which have advantages of increased inter-class separation and smaller intra-class distances. In this case, PLDA seems unnecessary or even counterproductive for the discriminative embeddings, and cosine similarity scoring (Cos) achieves better performance than PLDA in some situations. Motivated by this, in this paper, we systematically explore how to select back-ends (Cos or PLDA) for deep speaker embeddings to achieve better performance in different situations. By analyzing PLDA and the properties of DSEs extracted from models with different numbers of segment-level layers, we make the conjecture that Cos is better in same-domain situations and PLDA is better in cross-domain situations. We conduct experiments on VoxCeleb and NIST SRE datasets in four application situations, single-/multi-domain training and same-/cross-domain test, to validate our conjecture and briefly explain why back-ends adaption algorithms work.
翻译:线性分泌分析(PLDA)是早期语音识别方法(如i-矢量器和x-矢量器)的主要和必要的后端。然而,随着神经网络和基于边际损失功能的发展,我们可以获得深层语音嵌入器(DSEs),这些嵌入器的优点是各级之间分离增加,同级距离缩小。在这种情况下,PLDA似乎没有必要,甚至适得其反,在某些情况下,同级评分(Coss)比PLDA(Cos)的表现好。我们受此启发,在本文中系统探索如何选择深层语音嵌入的后端(Cos或PLDADA),以在不同情况下取得更好的性能。通过分析PLDA和从不同分级层模型中提取的DSEs特性,我们把Cos在相同情况下效果更好,PLDA(Cs)在交叉情况下比PLDA(CS)取得更好的效果。我们在四个应用情况下对VoxCeleb和NIS SRES (NIP Sent) 数据设置进行实验,在单一/多级的测试/多级后端验证,解释我们为什么/多级演算和模拟的测试/多级演算。