In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.
翻译:近些年来,为评估排名的集团公平性制定了若干衡量标准。鉴于这些衡量标准是在不同的应用背景和排序算法中制定的,因此,为特定情景选择哪种衡量标准并非简单易行。在本文中,我们对在公平排序背景下制定的现有群体公平性衡量标准进行了全面比较分析;由于应用背景不同,我们认为这种比较分析并非直截了当。因此,我们采取不言而喻的方法,为考虑到不同排名设置的集团公平度衡量标准设计了一套13种属性。然后,可以选择一种衡量标准,取决于它是否满足了这些属性的全部或一部分。我们将这些属性应用于11种现有群体公平性衡量标准,并且通过经验和理论结果,我们证明这些衡量标准大多只满足了拟议属性的一小部分。这些结果突出了现有衡量标准的局限性,并深入了解了如何在实际部署中评估和解释不同的公平性衡量标准。拟议的属性还可以帮助从业人员选择适当的衡量标准,以便在具体应用中评估公平性。