Automatic evaluation metrics are crucial to the development of generative systems. In recent years, pre-trained language model (PLM) based metrics, such as BERTScore, have been commonly adopted in various generation tasks. However, it has been demonstrated that PLMs encode a range of stereotypical societal biases, leading to a concern on the fairness of PLMs as metrics. To that end, this work presents the first systematic study on the social bias in PLM-based metrics. We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes, namely race, gender, religion, physical appearance, age, and socioeconomic status. In-depth analysis suggests that choosing paradigms (matching, regression, or generation) of the metric has a greater impact on fairness than choosing PLMs. In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.
翻译:自动评价指标对于发展基因化系统至关重要。近年来,以预先培训的语言模型(PLM)为基础的指标(如BERTScore)在各种代代任务中被普遍采用,但是,事实证明,PLM将一系列陈规定型的社会偏见编码成,从而引起人们对PLM作为指标的公平性的关切。为此,这项工作提出了关于以PLM为基础的指标的社会偏见的第一次系统研究。我们证明,以PLM为基础的指标在种族、性别、宗教、外观、年龄和社会经济地位等6个敏感属性方面,比传统指标表现出的社会偏见要大得多。深入的分析表明,选择该指标的范式(配对、回归或代代)比选择PLM的公平性影响更大。此外,我们开发了向PLM层次注入的不偏向适应器,减少基于PLM指标的偏向,同时保留评价文本生成的高性。