The human voice conveys unique characteristics of an individual, making voice biometrics a key technology for verifying identities in various industries. Despite the impressive progress of speaker recognition systems in terms of accuracy, a number of ethical and legal concerns has been raised, specifically relating to the fairness of such systems. In this paper, we aim to explore the disparity in performance achieved by state-of-the-art deep speaker recognition systems, when different groups of individuals characterized by a common sensitive attribute (e.g., gender) are considered. In order to mitigate the unfairness we uncovered by means of an exploratory study, we investigate whether balancing the representation of the different groups of individuals in the training set can lead to a more equal treatment of these demographic groups. Experiments on two state-of-the-art neural architectures and a large-scale public dataset show that models trained with demographically-balanced training sets exhibit a fairer behavior on different groups, while still being accurate. Our study is expected to provide a solid basis for instilling beyond-accuracy objectives (e.g., fairness) in speaker recognition.
翻译:人类的声音传达个人的独特特征,使声音生物鉴别技术成为核查不同行业身份的关键技术。尽管在准确性方面,语音识别系统取得了令人印象深刻的进展,但人们提出了若干伦理和法律关切,特别是这种系统是否公平。在本文件中,我们的目标是探讨最先进的深层语音识别系统的表现差异,在考虑具有共同敏感属性(如性别)的不同群体时,为了减轻我们通过探索性研究发现的不公现象,我们调查在培训中平衡不同群体的代表性是否能导致更平等地对待这些人口群体。关于两种最先进的神经神经结构和大规模公共数据集的实验表明,经过人口平衡培训的模型显示不同群体的行为更加公平,同时仍然准确。我们的研究可望为在语音识别中灌输超越准确性的目标(如公平性)提供一个坚实的基础。