The size of pretrained models is increasing, and so is their performance on a variety of NLP tasks. However, as their memorization capacity grows, they might pick up more social biases. In this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). We measure bias in three masked language model families (RoBERTa, DeBERTa, and T5) in two setups: directly using prompt based method, and using a downstream task (Winogender). We find on the one hand that larger models receive higher bias scores on the former task, but when evaluated on the latter, they make fewer gender errors. To examine these potentially conflicting results, we carefully investigate the behavior of the different models on Winogender. We find that while larger models outperform smaller ones, the probability that their mistakes are caused by gender bias is higher. Moreover, we find that the proportion of stereotypical errors compared to anti-stereotypical ones grows with the model size. Our findings highlight the potential risks that can arise from increasing model size.
翻译:培训前模式的规模正在扩大,在各种NLP任务上的表现也在增加,但是,随着其记忆能力的增长,它们可能会发现更多的社会偏见。在这项工作中,我们检查模型规模及其性别偏见(特别是职业性性别偏见)之间的联系。我们用两种模式(RoBERTA、DeBERTA和T5)衡量三个隐形语言模式家庭(RoBERTA、DeBERTA和T5)的偏差:直接使用快速方法,并使用下游任务(Winogender)。我们发现,一方面,较大的模型在前一项任务上得到更高的偏差分,但在后一项任务上进行评估时,它们会发现更多的性别错误。为了检查这些潜在的冲突结果,我们仔细调查不同模式在Winogender上的行为。我们发现,虽然较大的模型比较小的小,但其错误是由性别偏见造成的概率更高。此外,我们发现,定型错误与反陈规定型错误的比例随着模型的大小而不断增长。我们发现,我们发现,潜在的风险会随着模型规模的扩大而增加而增加。