Exploration is often necessary in online learning to maximize long-term reward, but it comes at the cost of short-term 'regret'. We study how this cost of exploration is shared across multiple groups. For example, in a clinical trial setting, patients who are assigned a sub-optimal treatment effectively incur the cost of exploration. When patients are associated with natural groups on the basis of, say, race or age, it is natural to ask whether the cost of exploration borne by any single group is 'fair'. So motivated, we introduce the 'grouped' bandit model. We leverage the theory of axiomatic bargaining, and the Nash bargaining solution in particular, to formalize what might constitute a fair division of the cost of exploration across groups. On the one hand, we show that any regret-optimal policy strikingly results in the least fair outcome: such policies will perversely leverage the most 'disadvantaged' groups when they can. More constructively, we derive policies that are optimally fair and simultaneously enjoy a small 'price of fairness'. We illustrate the relative merits of our algorithmic framework with a case study on contextual bandits for warfarin dosing where we are concerned with the cost of exploration across multiple races and age groups.
翻译:在网上学习以最大限度地获得长期奖励方面,探索往往是必要的,但这种探索往往以短期的“regret”为代价。我们研究这一勘探成本如何在多个群体之间分担。例如,在临床试验环境中,被分配到次优治疗的病人实际上承担了勘探成本。当病人根据种族或年龄与自然群体有联系时,自然会问任何一个群体承担的勘探成本是否为“公平”。如此积极性,我们引入了“分组”的强盗模式。我们利用不言而喻的讨价还价理论,特别是纳什讨价还价的解决方案,以正式确定什么可以构成对跨群体勘探成本的公平划分。一方面,我们表明任何遗憾的优化政策都会在最不公平的结果中产生惊人的结果:这种政策会反常地利用最“劣势”的群体,当他们能够这样做的时候。更具有建设性地说,我们所推论的政策是最佳的公平,同时享有小的“公平价格 ” 。我们用我们的算法框架的相对优点来说明我们所关注的关于跨战争的多种族和跨年龄的案例研究。