简单和最佳政策设计,为多武装匪徒安全应对重度风险 (A Simple and Optimal Policy Design with Safety against Heavy-tailed Risk for Multi-armed Bandits)

We design new policies that ensure both worst-case optimality for expected regret and light-tailed risk for regret distribution in the stochastic multi-armed bandit problem. Recently, arXiv:2109.13595 showed that information-theoretically optimized bandit algorithms suffer from some serious heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, as $T$ (the time horizon) increases. Inspired by their results, we further show that widely used policies (e.g., Upper Confidence Bound, Thompson Sampling) also incur heavy-tailed risk; and this heavy-tailed risk actually exists for all "instance-dependent consistent" policies. With the aim to ensure safety against such heavy-tailed risk, starting from the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an optimal exponential rate $\exp(-\Omega(\sqrt{T}))$. Next, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide explicit tail probability bound for any regret threshold under our policy design. Specifically, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. We also enhance the policy design to accommodate the "any-time" setting where $T$ is not known a priori, and prove equivalently desired policy performances as compared to the "fixed-time" setting with known $T$. A brief account of numerical experiments is conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk on regret distribution are compatible.

翻译：我们设计了新的政策, 既确保最坏情况的最佳性能, 满足预期的遗憾, 也确保对多武装土匪问题进行遗憾分配。最近, arXiv: 20109.135995 显示信息最优化土匪算法会遭受一些严重的重尾风险; 也就是说, 最坏情况是, 以1美元/ T美元这一多元利率缓慢地产生线性遗憾。受其结果的启发, 我们进一步显示, 广泛使用的政策( 例如, 超信任公司、 Thomps Sampling ) 也带来严重风险; 而这种严重、最优化的土匪运算法算法也存在这种严重的风险。为了保证安全, 从两股土匪环境开始, 我们提供一个简单的政策设计, (一) 最坏情况最坏的真相( 美元- Oright) 和最坏的磁盘政策预测显示, 最坏的奥卡路里程政策预测显示, 最坏的汇率比我们更坏的直径政策预测显示, 最坏的直径分析显示, 最坏的直径直径的汇率直径的政策预测显示, 直径分析显示, 直径的直径的直径的直径的直的的直径的策略的直的的平的直径直的的直径平的的的的的的平方的直径直径直方直方平方平方平方的的的的的的的显示直方的平方的平方直方的的平方平方平方的的的直方的的的平方的的平方平方的平方直方的平方直方直方直方平方平方平方平方平方平方的平方的的的平方直方的显示直方的直方直方平方平方平方