In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.
翻译:近些年来,据报告发生了许多事件,因为机器学习模式显示了基于种族、性别、年龄等的人之间的歧视。已经进行了研究,以衡量和减轻机器学习模式中的不公平现象。对于机器学习任务,通常的做法是建立一个管道,其中包括一套有定序的数据处理预处理阶段,然后是分类人员。然而,大多数关于公平性的研究都考虑了单一分类的预测任务。机械学习管道预处理阶段的公平性影响如何?此外,研究还表明,不公平现象的根源往往在数据本身而不是模型中根深蒂固。但没有进行研究,以衡量在数据预处理阶段进行的具体转变所造成的不公平现象。对于机器学习阶段,我们采用了因果公平性的方法,以说明ML管道数据处理预处理阶段的公平性影响。我们利用现有的衡量标准来界定各个阶段的公平性衡量尺度。然后,我们从三个不同来源收集的37个管道对预处理阶段进行了详细的公平性评价。我们的研究结果表明,某些数据变异者正在使模型表现出不公道的公平性。我们在本文件中提出了公平性结构中采用的一系列公平性模式,我们最后用了一种适当的数据变式,我们用了一个阶段来分析。