There is increasing attention to evaluating the fairness of search system ranking decisions. These metrics often consider the membership of items to particular groups, often identified using protected attributes such as gender or ethnicity. To date, these metrics typically assume the availability and completeness of protected attribute labels of items. However, the protected attributes of individuals are rarely present, limiting the application of fair ranking metrics in large scale systems. In order to address this problem, we propose a sampling strategy and estimation technique for four fair ranking metrics. We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items. We evaluate our approach using both simulated and real world data. Our experimental results demonstrate that our method can estimate this family of fair ranking metrics and provides a robust, reliable alternative to exhaustive or random data annotation.
翻译:评估搜索系统排名决定的公平性越来越受到重视,这些衡量标准常常考虑特定群体的项目成员情况,往往使用诸如性别或族裔等受保护的属性加以识别。迄今为止,这些衡量标准通常假定项目受保护属性标签的可用性和完整性。然而,个人受保护的属性很少存在,限制了在大型系统中适用公平评级指标。为了解决这一问题,我们为四类公平评级指标提出了一个抽样战略和估算技术。我们制定了一个强大和不偏不倚的估测标准,即使有非常有限的标签项目,也能运作。我们利用模拟数据和实际世界数据评估我们的方法。我们的实验结果表明,我们的方法可以估计公平评级指标的这一组,并提供可靠、可靠的替代详尽或随机数据说明的替代方法。