We design simple and optimal policies that ensure safety against heavy-tailed risk in the classical multi-armed bandit problem. We start by showing that some widely used policies such as the standard Upper Confidence Bound policy and the Thompson Sampling policy incur heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, where $T$ is the time horizon. We further show that this heavy-tailed risk exists for all "instance-dependent consistent" policies. To ensure safety against such heavy-tailed risk, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Namely, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
翻译:我们设计了简单和最佳的政策,以确保安全,防止传统多武装匪徒问题中出现重创风险。我们首先表明,一些广泛使用的政策,如标准高信任度政策和汤普森抽样政策,会产生严重风险;也就是说,最坏的情况是,以1美元/特元的多元汇率缓慢出现线性遗憾缓慢衰减,而美元/特元是时间范围。我们进一步表明,所有“最差依赖(sqrt{T})”的政策都存在这种严重风险。为了保证安全,对于两武装匪徒设定来说,我们提供了一个简单的政策设计,即(一) 以美元/特元(sqrtroundation Oroundal)为标准,预期最坏的情况是最坏的情况最佳的情况。 最坏的情况是,我们最差的风险率(sqright-rt{T} 和最差的尾端风险率(sqrentrentral-ral) 显示,我们最差的政策设计中最差的情况是最差的。最后,我们最差的数值/最差的政策分析显示,我们最差的风险是最差的数值。