Speaker verification (SV) systems are currently being used to make sensitive decisions like giving access to bank accounts or deciding whether the voice of a suspect coincides with that of the perpetrator of a crime. Ensuring that these systems are fair and do not disfavor any particular group is crucial. In this work, we analyze the performance of several state-of-the-art SV systems across groups defined by the accent of the speakers when speaking English. To this end, we curated a new dataset based on the VoxCeleb corpus where we carefully selected samples from speakers with accents from different countries. We use this dataset to evaluate system performance for several SV systems trained with VoxCeleb data. We show that, while discrimination performance is reasonably robust across accent groups, calibration performance degrades dramatically on some accents that are not well represented in the training data. Finally, we show that a simple data balancing approach mitigates this undesirable bias, being particularly effective when applied to our recently-proposed discriminative condition-aware backend.
翻译:目前正在使用语言校验(SV)系统来作出敏感决定,例如允许进入银行账户,或决定嫌疑人的声音是否与犯罪实施者的声音相吻合。确保这些系统是公平的,不排斥任何特定群体至关重要。在这项工作中,我们分析讲英语者口音所定义的各组最先进的SV系统的性能。为此,我们根据VoxCelebproject精心从不同国家有口音的发言者中挑选了样本,建立了一个新的数据集。我们利用该数据集来评价几个受过VoxCeleb数据培训的SV系统系统在系统上的性能。我们表明,虽然在口音组之间,歧视表现相当强劲,但校准性性能在培训数据中并不充分体现的一些口音上急剧下降。最后,我们表明,简单的数据平衡方法可以减轻这种不可取的偏差,在应用我们最近提出的歧视性条件意识后端时,特别有效。