Mismatch between enrollment and test conditions causes serious performance degradation on speaker recognition systems. This paper presents a statistics decomposition (SD) approach to solve this problem. This approach decomposes the PLDA score into three components that corresponding to enrollment, prediction and normalization respectively. Given that correct statistics are used in each component, the resultant score is theoretically optimal. A comprehensive experimental study was conducted on three datasets with different types of mismatch: (1) physical channel mismatch, (2) speaking behavior mismatch, (3) near-far recording mismatch. The results demonstrated that the proposed SD approach is highly effective, and outperforms the ad-hoc multi-condition training approach that is commonly adopted but not optimal in theory.
翻译:校考和测试条件的错配导致扬声器识别系统出现严重性能退化。本文件展示了解决这一问题的统计分解(SD)方法。这一方法将PLDA分数分成了分别与招生、预测和正常化相对应的三个部分。鉴于每个部分使用正确的统计数据,由此得出的分数在理论上是最佳的。对三种不同类型不匹配的数据集进行了全面实验研究:(1) 物理频道不匹配,(2) 言语行为不匹配,(3) 近距离记录不匹配。结果显示,拟议的SD方法非常有效,超过了通常采用但理论上不理想的特设多条件培训方法。