将巴耶斯联合基因化模式纳入歧视性学习框架,供发言者核实 (Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification)

The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature selection ability, which is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary classification task where a neural network (NN) based discriminative learning could be applied. Through discriminative learning, the nuisance features could be removed with the help of label supervision. However, the discriminative learning pays more attention to classification boundaries which is prone to overfitting to training data and yielding poor generalization on testing data. In this paper, we propose a hybrid learning framework, i.e., integrating a joint Bayesian (JB) generative model into a neural discriminative learning framework for SV. A Siamese NN is built with dense layers to approximate the mapping functions used in the SV pipeline with the JB model, and the L-LLR score estimated based on the JB model is connected to the distance metric in a pair-wised discriminative learning. By initializing the Siamese NN with the parameters learned from the JB model, we further train the model parameters with the pair-wised samples as a binary discrimination task. Moreover, direct evaluation metric in SV, i.e., minimum empirical Bayes risk, is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on speakers in the wild (SITW) and Voxceleb corpora. Experimental results showed that our proposed model improved the performance with a large margin compared with state-of-the-art models for SV.

翻译：语音校验( SV) 的任务是通过目标或假冒的演讲者来决定语句。在大多数 SV 研究中, 日志类比比( L_LLRR) 的评分是根据演讲者特点的基因化概率模型来估计的, 并与决策门槛进行比较。然而, 基因模型通常侧重于特征分布, 并且没有歧视性特征选择能力, 很容易被干扰。 SV 作为一种假设测试, 可以作为一种双轨分类, 用来应用神经网络( NNN) 的歧视性学习。通过歧视性学习, 可以在标签监督帮助下取消对日志类比比比比值的评分值。然而, 基因模型通常侧重于特征分布, 并且不易因测试而产生偏差性特征选择能力。本文中, 我们提出混合学习框架, 将联合Bayesian (JB) 的基因化模型纳入SV 改进的神经化学习框架。系统模型Samese 模型与Bread Strial 样级模型一起, 与Treal Streal Streal Freal Sal commode 进行深入的JB 学习。