了解通过模型和数据比对相互作用发现欺诈方面的不公平现象 (Understanding Unfairness in Fraud Detection through Model and Data Bias Interactions)

In recent years, machine learning algorithms have become ubiquitous in a multitude of high-stakes decision-making applications. The unparalleled ability of machine learning algorithms to learn patterns from data also enables them to incorporate biases embedded within. A biased model can then make decisions that disproportionately harm certain groups in society -- limiting their access to financial services, for example. The awareness of this problem has given rise to the field of Fair ML, which focuses on studying, measuring, and mitigating unfairness in algorithmic prediction, with respect to a set of protected groups (e.g., race or gender). However, the underlying causes for algorithmic unfairness still remain elusive, with researchers divided between blaming either the ML algorithms or the data they are trained on. In this work, we maintain that algorithmic unfairness stems from interactions between models and biases in the data, rather than from isolated contributions of either of them. To this end, we propose a taxonomy to characterize data bias and we study a set of hypotheses regarding the fairness-accuracy trade-offs that fairness-blind ML algorithms exhibit under different data bias settings. On our real-world account-opening fraud use case, we find that each setting entails specific trade-offs, affecting fairness in expected value and variance -- the latter often going unnoticed. Moreover, we show how algorithms compare differently in terms of accuracy and fairness, depending on the biases affecting the data. Finally, we note that under specific data bias conditions, simple pre-processing interventions can successfully balance group-wise error rates, while the same techniques fail in more complex settings.

翻译：近年来,机器学习算法在众多的高级决策应用中变得无处不在。但是,机器学习算法从数据中学习模式的无比能力也使得它们能够吸收数据中的偏差。一个有偏见的模式随后可以做出过分伤害社会某些群体的决定,例如限制他们获得金融服务的机会。对这一问题的认识已导致公平 ML领域,它侧重于研究、测量和减轻逻辑预测中的不公平,涉及一组受保护的群体(例如种族或性别)。然而,算法不公平的根本原因仍然难以找到,研究人员在指责 ML 算法或他们所培训的数据之间有分歧。在这项工作中,我们坚持认为,算法上的不公平来自模型和数据偏差之间的互动,而不是二者中任何一个的孤立贡献。为此,我们建议一种分类法来描述数据偏差,我们研究一套关于公平-准确性交易的假设,在不同的数据偏差环境下,公平性 ML 分析过程显示的公平性条件,在不同的数据偏差环境中,我们经常在不同的数据偏差情况下,我们使用一个不透明性交易账户显示的偏差性,在不同的交易中,我们经常在不同的交易中发现一个错误中会如何影响具体的估价。