Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms. For a multi-class toxic language classifier, we leverage a concept-based explanation framework to calculate the sensitivity of the model to the concept of sentiment, which has been used before as a salient feature for toxic language detection. Our results show that although for some classes, the classifier has learned the sentiment information as expected, this information is outweighed by the influence of identity terms as input features. This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes. The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.
翻译:先前关于有毒语言分类员公平性的工作,将模型的输出与不同身份术语作为输入特征进行比较,但没有考虑背景中其他重要概念的影响。 这里,除了身份术语之外,我们考虑到分类员所学的高层次潜伏特征,并调查这些特征与身份术语之间的互动关系。 对于多级有毒语言分类员,我们利用基于概念的解释框架来计算模型对情绪概念的敏感性,这以前曾被用作毒性语言检测的突出特征。我们的结果显示,虽然对于某些类别来说,分类员学习了预期的情绪信息,但这种信息被身份术语作为输入特征的影响所超过。这项工作是朝着评估程序公平性迈出的一步,因为不公平的过程会导致不公平的结果。产生的知识可以指导偏向性技术,以确保培训数据集中除了身份术语之外的重要概念得到很好的体现。