An increasing awareness of biased patterns in natural language processing resources, like BERT, has motivated many metrics to quantify `bias' and `fairness'. But comparing the results of different metrics and the works that evaluate with such metrics remains difficult, if not outright impossible. We survey the existing literature on fairness metrics for pretrained language models and experimentally evaluate compatibility, including both biases in language models as in their downstream tasks. We do this by a mixture of traditional literature survey and correlation analysis, as well as by running empirical evaluations. We find that many metrics are not compatible and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings. These results indicate that fairness or bias evaluation remains challenging for contextualized language models, if not at least highly subjective. To improve future comparisons and fairness evaluations, we recommend avoiding embedding-based metrics and focusing on fairness evaluations in downstream tasks.
翻译:对自然语言处理资源(如BERT)中偏向模式的认识不断提高,促使许多衡量标准量化“偏见”和“公平”。但是,比较不同衡量标准的结果和与这类衡量标准一起评价的工作即使不是完全不可能,也仍然是困难的。我们调查关于预先培训的语言模式公平衡量标准的现有文献,并实验性地评价兼容性,包括语言模式的偏向性,如同其下游任务一样。我们通过混合传统文献调查和相关分析以及经验性评估这样做。我们发现,许多衡量标准不兼容,而且高度依赖(一) 模板、(二) 属性和目标种子以及(三) 嵌入式选择。这些结果表明,公平或偏向评价对于背景化语言模式来说仍然是挑战,至少是高度主观的。为了改进今后的比较和公平评价,我们建议避免嵌入基于指标,并侧重于下游任务的公平评价。