We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given $\alpha\in[1/2, 1)$ and $\beta\in[0, \alpha]$, our policy achieves a worst-case expected regret of $\tilde O(T^\alpha)$ (we call it $\alpha$-optimal) and an instance-dependent expected regret of $\tilde O(T^\beta)$ (we call it $\beta$-consistent), while enjoys a probability of incurring an $\tilde O(T^\delta)$ regret ($\delta\geq\alpha$ in the worst-case scenario and $\delta\geq\beta$ in the instance-dependent scenario) that decays exponentially with a polynomial $T$ term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks.
翻译:该研究探讨了在随机多臂赌博机问题中,遗憾分布中期望与风险之间的折衷关系。我们充分解释了策略设计的三个期望性质之间的相互作用:最坏情况优化性、实例相关一致性和轻尾风险。我们展示了期望遗憾大小的顺序如何精确地影响最坏情况和实例相关情景下的遗憾尾部概率的下降速率。我们提出了一种新颖的策略,用于描绘任何遗憾阈值的最优遗憾尾部概率。具体而言,对于任何给定的 $\alpha\in[1/2, 1)$ 和 $\beta\in[0, \alpha]$,我们的策略实现了最坏情况下的 $\tilde O(T^\alpha)$ 期望遗憾(我们称其为 $\alpha$-最优)和实例相关下的 $\tilde O(T^\beta)$ 期望遗憾(我们称其为 $\beta$-一致),同时具有承受 $\tilde O(T^\delta)$ 遗憾的概率(最坏情况下 $\delta\geq\alpha$,实例相关下 $\delta\geq\beta$),并且该概率在多项式 $T$ 项下以指数方式衰减。我们证明了这样的衰减率是最优的。此外,我们发现了在实例相关情境下优化尾部率的固有差距,在该情境下时间规划的视角是否事先给定也有所不同。有趣的是,当涉及到最坏情况时,这种差距消失了。最后,我们将我们提出的策略设计扩展到(1)具有非平稳基线收益的随机多臂赌博机情境和(2)随机线性赌博机情境。我们的研究揭示了最坏情况和实例相关的遗憾期望和遗憾尾部风险之间的折衷关系,表明更多的次优性和不一致性留下了更多产生大遗憾的轻尾风险的空间,并且事先知道规划时间可以对缓解尾部风险产生影响。