Ranked lists are frequently used by information retrieval (IR) systems to present results believed to be relevant to the users information need. Fairness is a relatively new but important aspect of these rankings to measure, joining a rich set of metrics that go beyond traditional accuracy or utility constructs to provide a more holistic understanding of IR system behavior. In the last few years, several metrics have been proposed to quantify the (un)fairness of rankings, particularly with respect to particular group(s) of content providers, but comparative analyses of these metrics -- particularly for IR -- is lacking. There is limited guidance, therefore, to decide what fairness metrics are applicable to a specific scenario, or assessment of the extent to which metrics agree or disagree applied to real data. In this paper, we describe several fair ranking metrics from existing literature in a common notation, enabling direct comparison of their assumptions, goals, and design choices; we then empirically compare them on multiple data sets covering both search and recommendation tasks.
翻译:信息检索系统经常使用排名清单来提出被认为与用户信息需求相关的结果;公平是衡量这些排名的一个相对较新但重要的方面,它结合了一套超出传统准确性或实用结构的丰富的衡量标准,对IR系统行为提供了更全面的理解;在过去几年里,提出了若干衡量标准,以量化排名的(不公平),特别是对内容提供者的特定组别而言,但缺乏对这些衡量标准的比较分析,特别是对IR的比较分析;因此,在确定哪些公平衡量标准适用于具体情景或评估指标在多大程度上同意或不同意适用于真实数据方面,指导有限;在本文中,我们用共同的标记描述现有文献中的若干公平排序指标,以便能够直接比较其假设、目标和设计选择;然后,我们在涉及搜索和建议任务的多个数据集上对它们进行经验性比较。