In this paper, we study the problem of fair sequential decision making with biased linear bandit feedback. At each round, a player selects an action described by a covariate and by a sensitive attribute. The perceived reward is a linear combination of the covariates of the chosen action, but the player only observes a biased evaluation of this reward, depending on the sensitive attribute. To characterize the difficulty of this problem, we design a phased elimination algorithm that corrects the unfair evaluations, and establish upper bounds on its regret. We show that the worst-case regret is smaller than $\mathcal{O}(\kappa_*^{1/3}\log(T)^{1/3}T^{2/3})$, where $\kappa_*$ is an explicit geometrical constant characterizing the difficulty of bias estimation. We prove lower bounds on the worst-case regret for some sets of actions showing that this rate is tight up to a possible sub-logarithmic factor. We also derive gap-dependent upper bounds on the regret, and matching lower bounds for some problem instance.Interestingly, these results reveal a transition between a regime where the problem is as difficult as its unbiased counterpart, and a regime where it can be much harder.
翻译:在本文中, 我们用偏差的线性匪徒反馈来研究公平顺序决策的问题。 在每回合中, 玩家会选择由共变和敏感属性描述的动作。 想象到的奖励是所选择动作共变的线性组合, 但玩家只观察到对奖赏的评价有偏差, 取决于敏感属性。 为了说明这个问题的困难, 我们设计了一个分阶段消除算法, 纠正不公正的评价, 并设定其遗憾的上限。 我们同时显示, 最坏的遗憾小于 $\mathcal{ O} (\kappa_ 1/ 3 ⁇ log (T) 1/ 1/ 3} T\\ 2/ 3/ 3} 敏感属性。 在每回合中, 所看到的奖赏是明确的几何分常数性常数计算偏差估计难度。 我们证明, 最坏的遗憾是, 某些行动显示这个比率与可能的次反差因素相近。 我们还发现, 最差的上限在遗憾上, 和较低的框框框框中, 。 。 有趣的是, 这些结果揭示了制度之间的过渡更难, 。