Classic online prediction algorithms, such as Hedge, are inherently unfair by design, as they try to play the most rewarding arm as many times as possible while ignoring the sub-optimal arms to achieve sublinear regret. In this paper, we consider a fair online prediction problem in the adversarial setting with hard lower bounds on the rate of accrual of rewards for all arms. By combining elementary queueing theory with online learning, we propose a new online prediction policy, called BanditQ, that achieves the target rate constraints while achieving a regret of $O(T^{3/4})$ in the full-information setting. The design and analysis of BanditQ involve a novel use of the potential function method and are of independent interest.
翻译:经典的在线预测算法(如Hedge)由于试图尽可能多地玩最有回报的手臂而忽略了次优手臂,而在设计上是固有不公平的,以实现次线性遗憾。在本文中,我们考虑了对所有手臂的奖励率拥有严格下界的对抗性环境中的公平在线预测问题。通过将基本排队理论与在线学习相结合,我们提出了一种新的在线预测策略——BanditQ,该策略实现了目标速率约束,同时在全信息设置下实现了一个$O(T^{3/4})$的遗憾。BanditQ的设计和分析涉及了潜在函数方法的新型应用,具有独立的利息。