Despite the success of deep neural networks (DNNs) in enabling on-device voice assistants, increasing evidence of bias and discrimination in machine learning is raising the urgency of investigating the fairness of these systems. Speaker verification is a form of biometric identification that gives access to voice assistants. Due to a lack of fairness metrics and evaluation frameworks that are appropriate for testing the fairness of speaker verification components, little is known about how model performance varies across subgroups, and what factors influence performance variation. To tackle this emerging challenge, we design and develop SVEva Fair, an accessible, actionable and model-agnostic framework for evaluating the fairness of speaker verification components. The framework provides evaluation measures and visualisations to interrogate model performance across speaker subgroups and compare fairness between models. We demonstrate SVEva Fair in a case study with end-to-end DNNs trained on the VoxCeleb datasets to reveal potential bias in existing embedded speech recognition systems based on the demographic attributes of speakers. Our evaluation shows that publicly accessible benchmark models are not fair and consistently produce worse predictions for some nationalities, and for female speakers of most nationalities. To pave the way for fair and reliable embedded speaker verification, SVEva Fair has been implemented as an open-source python library and can be integrated into the embedded ML development pipeline to facilitate developers and researchers in troubleshooting unreliable speaker verification performance, and selecting high impact approaches for mitigating fairness challenges
翻译:尽管深层神经网络(DNNS)成功地使声音辅助人员能够站立起来,但越来越多的证据表明机器学习中存在偏见和歧视,这提高了调查这些系统公正性的迫切性; 议长核查是一种生物鉴别识别形式,使声音助理能够使用; 由于缺乏公平性指标和评价框架,适合于测试演讲者核查组成部分的公平性,人们对各分组之间示范性业绩如何不同以及哪些因素影响业绩差异知之甚少; 为应对这一新出现的挑战,我们设计并开发了SVevva Fa Fair,这是一个可获取的、可操作的和示范性、不可知性的框架,用以评价演讲者核查组成部分的公平性; 该框架提供了评价措施和可视化措施,用以询问演讲者分组的示范性业绩,并比较各种模式之间的公平性; 我们通过案例研究展示SVevva Fair-end-end DNNSs, 培训了VoxCeleb数据集,以揭示现有基于演讲者人口特征的语音识别系统的潜在偏差。 我们的评估表明,公开性基准模型无法为一些国籍和大多数演讲者提供更糟糕的预测,因此,可以将公平和可靠的数据库内置的核查纳入公平和可靠版本,从而在公平和可靠地进行公平和稳定化的核查。