Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we can certainly reduce the problem. This paper shows that many of those fairness metrics effectively measure the same thing. Based on experiments using seven real-world datasets, we find that (a) 26 classification metrics can be clustered into seven groups, and (b) four dataset metrics can be clustered into three groups. Further, each reduced set may actually predict different things. Hence, it is no longer necessary (or even possible) to satisfy all fairness metrics. In summary, to simplify the fairness testing problem, we recommend the following steps: (1) determine what type of fairness is desirable (and we offer a handful of such types); then (2) lookup those types in our clusters; then (3) just test for one item per cluster. To support that processing, all our scripts (and example datasets) are available at https://github.com/Repoanonymous/Fairness\_Metrics.
翻译:伦理偏见的机器测试软件已经成为一个紧迫的当前问题。 作为回应,最近的研究提出了大量新的公平指标,例如IBM AIF360工具包中的数十项公平指标。这提出了这样一个问题:任何公平工具如何能满足如此多种多样的目标?虽然我们不能完全简化公平测试的任务,但我们肯定可以减少问题。本文表明,许多公平指标都有效地测量了同样的问题。根据7个真实世界数据集的实验,我们发现:(a) 26个分类指标可以分为7个组,(b) 4个数据集指标可以分为3个组。此外,每套减少的数据集可能实际上可以预测不同的情况。因此,不再需要(甚至不可能)满足所有公平度指标。简而言之,为了简化公平测试问题,我们建议采取以下步骤:(1) 确定何种公平是可取的(我们提供了少量的此类数据);然后(2) 在我们的分类组中查看这些类型;然后(3) 仅对每个组的一个项目进行测试。为了支持处理,我们所有的脚本/Regisma/ recommasils。