Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound ($\texttt{UCB}$) algorithm, $\texttt{Hedge}$, and $\texttt{EXP3}$, are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among $N$ arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called $\texttt{BanditQ}$ that achieves the target reward rates while achieving a regret and target rate violation penalty of $O(T^{\frac{3}{4}}).$ In the full-information setting, the regret bound can be further improved to $O(\sqrt{T})$ when considering the average regret over the entire horizon of length $T$. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the $\texttt{BanditQ}$ policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.
翻译:暂无翻译