Natural Language Processing (NLP) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. To guide efforts at debiasing these systems, the NLP community relies on a variety of metrics that quantify bias in models. Some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. Do these intrinsic and extrinsic metrics correlate with each other? We compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. Our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. We urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. To aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.
翻译:自然语言处理系统(NLP)学会了有害的社会偏见,导致它们扩大不平等,因为它们在越来越多的情况下被运用。为了指导贬低这些制度的努力,NLP社区依靠各种衡量标准来量化模型中的偏向。其中一些衡量标准是内在的,衡量文字嵌入空间中的偏向,有些则是外在的,衡量词嵌入所促成的下游任务中的偏向。这些内在的和外在的衡量标准是相互联系的。我们将这些内在的和外在的衡量标准在涉及不同任务和实验条件的数百个经过训练的模型中进行比较。我们的结果显示,这些衡量标准之间没有可靠的关联性,这些衡量标准存在于各种任务和语言的所有情况中。我们敦促从事偏向性研究的研究人员把重点放在偏见的极端衡量标准上,并通过创建新的挑战数据集和附加说明的测试数据使这些措施的使用更加可行。为了帮助这一努力,我们发布了代码、新的内在的衡量标准,以及侧重于仇恨言论中的性别偏见的附加说明的测试组。