将公平在线学习的探索-利用交易调整到排名 (Calibrating Explore-Exploit Trade-off for Fair Online Learning to Rank)

Online learning to rank (OL2R) has attracted great research interests in recent years, thanks to its advantages in avoiding expensive relevance labeling as required in offline supervised ranking model learning. Such a solution explores the unknowns (e.g., intentionally present selected results on top positions) to improve its relevance estimation. This however triggers concerns on its ranking fairness: different groups of items might receive differential treatments during the course of OL2R. But existing fair ranking solutions usually require the knowledge of result relevance or a performing ranker beforehand, which contradicts with the setting of OL2R and thus cannot be directly applied to guarantee fairness. In this work, we propose a general framework to achieve fairness defined by group exposure in OL2R. The key idea is to calibrate exploration and exploitation for fairness control, relevance learning and online ranking quality. In particular, when the model is exploring a set of results for relevance feedback, we confine the exploration within a subset of random permutations, where fairness across groups is maintained while the feedback is still unbiased. Theoretically we prove such a strategy introduces minimum distortion in OL2R's regret to obtain fairness. Extensive empirical analysis is performed on two public learning to rank benchmark datasets to demonstrate the effectiveness of the proposed solution compared to existing fair OL2R solutions.

翻译：近些年来,在线学习排名(OL2R)吸引了巨大的研究兴趣,因为它在避免按离线监督的离线排名模式所要求的昂贵关联性标签方面有其优势,从而避免了昂贵的关联性标签,这种解决办法探索了未知因素(例如,有意在顶层职位上提供选定结果),以提高其相关性估计。然而,这引起了人们对其排名公正性的关切:不同项目组在OL2R过程中可能获得不同待遇。但是,现有的公平排名解决方案通常要求事先了解结果相关性或表现良好的排名,这与OL2R的设置相矛盾,因此无法直接用于保障公平。在这项工作中,我们提出了一个实现由在OL2R中群体接触所定义的公平性的一般框架。关键的想法是,为公平控制、关联性学习和在线排名质量而调整探索和利用。特别是当模型正在探索一套相关性反馈结果时,我们把探索范围限制在随机的一组偏差中,即保持各组之间的公平性,而反馈仍然是公正的。从理论上讲,我们证明这种战略在OL2R的遗憾中引入了最低限度的扭曲性,以便获得公平性。