Numerous types of social biases have been identified in pre-trained language models (PLMs), and various intrinsic bias evaluation measures have been proposed for quantifying those social biases. Prior works have relied on human annotated examples to compare existing intrinsic bias evaluation measures. However, this approach is not easily adaptable to different languages nor amenable to large scale evaluations due to the costs and difficulties when recruiting human annotators. To overcome this limitation, we propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples. Specifically, we create multiple bias-controlled versions of PLMs using varying amounts of male vs. female gendered sentences, mined automatically from an unannotated corpus using gender-related word lists. Next, each bias-controlled PLM is evaluated using an intrinsic bias evaluation measure, and the rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed. Experiments on multiple corpora and PLMs repeatedly show that the correlations reported by our proposed method that does not require human annotated examples are comparable to those computed using human annotated examples in prior work.
翻译:在经过培训的语文模式中,已经查明了多种类型的社会偏见,并提出了各种内在偏见评价措施,以量化这些社会偏见; 以往的工作依靠人文附加说明的例子来比较现有的内在偏见评价措施; 然而,由于招聘人文说明员的成本和困难,这种办法不容易适应不同的语言,也不容易进行大规模评价; 为了克服这一限制,我们提议了一个方法,比较内在的性别偏见评价措施,而不必依靠人文附加说明的例子; 具体地说,我们利用不同数量的男性与女性的性别判刑,创造了多种受偏见控制的PLM版本; 自动从与性别有关的词汇清单的无注释材料中提取; 下一步,利用固有的偏见评价措施对每个受偏见限制的PLM进行评价,并计算计算出计算偏见分数与微调PLM所用性别比例之间的等级关系; 对多个子体和PLMs的实验一再表明,我们拟议方法报告的不需要人文附加说明的例子的关联,与以前工作中使用人注的例子计算出的数字是可比的。