Understanding the cumulative effect of multiple fairness enhancing interventions at different stages of the machine learning (ML) pipeline is a critical and underexplored facet of the fairness literature. Such knowledge can be valuable to data scientists/ML practitioners in designing fair ML pipelines. This paper takes the first step in exploring this area by undertaking an extensive empirical study comprising 60 combinations of interventions, 9 fairness metrics, 2 utility metrics (Accuracy and F1 Score) across 4 benchmark datasets. We quantitatively analyze the experimental data to measure the impact of multiple interventions on fairness, utility and population groups. We found that applying multiple interventions results in better fairness and lower utility than individual interventions on aggregate. However, adding more interventions do no always result in better fairness or worse utility. The likelihood of achieving high performance (F1 Score) along with high fairness increases with larger number of interventions. On the downside, we found that fairness-enhancing interventions can negatively impact different population groups, especially the privileged group. This study highlights the need for new fairness metrics that account for the impact on different population groups apart from just the disparity between groups. Lastly, we offer a list of combinations of interventions that perform best for different fairness and utility metrics to aid the design of fair ML pipelines.
翻译:理解在机器学习(ML)管道不同阶段加强干预的多重公平性累积效应是公平文献中一个关键和未得到充分探讨的方面。这种知识对于数据科学家/ML从业者设计公平的ML管道来说可能很有价值。本文件是探索这个领域的第一步,通过开展一项广泛的经验研究,在4个基准数据集中进行60种干预措施、9个公平度、2个通用指标(准确度和F1分)的组合;我们从数量上分析实验数据,以衡量多种干预措施对公平性、效用和人口群体的影响。我们发现,采用多种干预措施比个人干预措施总体而言更公平,效用更低。然而,增加更多干预措施并不总是产生更好的公平性或更差的效用。实现高业绩(F1分)和高公平性提高的可能性,同时采取更多干预措施。在负面方面,我们发现加强公平性干预措施可能对不同的人口群体,特别是特权群体产生消极影响。我们从数量上分析试验数据,以衡量新的公平性指标,说明除了群体之间的差别之外,对不同人口群体的影响。我们最后提出一个援助使用率组合清单。