Generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. These findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. However, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning.
翻译:事实证明,从经过培训的大型语言模型中产生的文本显示了对各种人口结构的各种有害、人性的偏见,这些调查结果促使人们作出巨大努力,以了解和衡量这些影响,目的是提供基准,指导发展减少这些陈规定型协会的技术,然而,正如最近的研究表明,目前的基准缺乏强有力的实验性结构,从而阻碍了从其评价指标中得出有意义的结论的推论。在本文件中,我们扩展了这些论点,并表明,旨在衡量陈规定型观念的现有技术和基准往往不准确,而且包含高度的实验性噪音,严重限制了我们从基准语言模型中获得的知识。因此,我们提出一个新的框架,以有力衡量和量化典型语言模型所显示的偏见。最后,我们利用这个框架调查GPT-3的职业性别偏见,并提出在不需要微调的情况下减轻这些偏见的迅速技术。