State-of-art speaker verification (SV) systems use a back-end model to score the similarity of speaker embeddings extracted from a neural network model. The commonly used back-end models are the cosine scoring and the probabilistic linear discriminant analysis (PLDA) scoring. With the recently developed neural embeddings, the theoretically more appealing PLDA approach is found to have no advantage against or even be inferior the simple cosine scoring in terms of SV system performance. This paper presents an investigation on the relation between the two scoring approaches, aiming to explain the above counter-intuitive observation. It is shown that the cosine scoring is essentially a special case of PLDA scoring. In other words, by properly setting the parameters of PLDA, the two back-ends become equivalent. As a consequence, the cosine scoring not only inherits the basic assumptions for the PLDA but also introduces additional assumptions on the properties of input embeddings. Experiments show that the dimensional independence assumption required by the cosine scoring contributes most to the performance gap between the two methods under the domain-matched condition. When there is severe domain mismatch and the dimensional independence assumption does not hold, the PLDA would perform better than the cosine for domain adaptation.
翻译:最先进的语音校验(SV)系统使用一个后端模型来评分从神经网络模型中提取的语音嵌入器的相似性。常用的后端模型是 Cosine 评分和概率线谱分析(PLDA) 评分。由于最近开发的神经嵌入器,在理论上更具吸引力的PLDA 方法被认为对SV系统性能的简单共振评分没有优势,甚至更低。本文展示了对两种评分方法之间关系的调查,目的是解释上述反直觉观察。显示共弦评分基本上是PLDA评分的一个特例。换句话说,通过正确设定PLDA参数,两个后端就变得等同。因此,Cosine评分不仅继承了PLDA的基本假设,而且还对投入嵌入功能的特性提出了更多的假设。 实验显示,Comsine 评分所要求的维独立假设对域域域域域域内两种方法的绩效差距起了最大作用,但当域域域域域内不进行更强的调整时,则保持了磁度。