A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals: does the classifier output change if a sensitive attribute in the input is changed? Existing counterfactual generation methods typically rely on wordlists or templates, producing simple counterfactuals that don't take into account grammar, context, or subtle sensitive attribute references, and could miss issues that the wordlist creators had not considered. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to make progress on this task. We show that this LLM-based method can produce complex counterfactuals that existing methods cannot, comparing the performance of various counterfactual generation methods on the Civil Comments dataset and showing their value in evaluating a toxicity classifier.
翻译:在基于文本的分类中测试公平问题的通用方法就是使用反事实:如果输入中的敏感属性被改变,分类器产出是否改变?现有的反事实生成方法通常依赖单词列表或模板,产生简单的反事实,而不考虑语法、上下文或微妙的敏感属性参考,并可能忽略单词创建者没有考虑的问题。在本文中,我们引入了生成反事实的任务,以克服这些缺陷,并演示如何利用大型语言模型(LLMs)来推进这项工作。我们表明,基于LLM的方法可以产生复杂的反事实,即现有方法无法比较民事评论数据集中各种反事实生成方法的性能,并展示其在评估毒性分类器中的价值。