Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get exposed to the users. This raises questions of fairness to the items -- and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits. We prove that the algorithms achieve sub-linear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can fairly allocate exposure to different arms effectively.
翻译:在网上系统(如市场、音乐流、新闻)中,上下文的土匪算法被广泛用于推荐,在网上系统(如市场、音乐流、新闻)中,上下文的土匪算法现在对哪些物品暴露在用户面前产生了重大影响。这引起了对物品的公正性问题 -- -- 以及对从这种曝光中受益的销售商、艺术家和作家的公正性问题。我们争辩说,传统的土匪算法可能会导致不可取和不公平的赢家通吃的风险分配。为了解决这个问题,我们提出了一个新的土匪算法目标,即保证在接触物品时做到基于功劳的公平,同时最大限度地发挥用户的效用。我们在这个环境中提出公平遗憾和奖励遗憾,并介绍精明多臂强盗和线性线性强盗的算法。我们证明,这些算法实现了亚线性公平、遗憾和奖励。除了理论分析外,我们还提供了经验证据,证明这些算法可以公平地将暴露在不同的武器上。