Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the associated algorithms necessarily have the undesirable feature that the tail of the regret distribution behaves like that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the number of sub-optimal arm plays. We show that optimized Thompson sampling and UCB bandit designs are also fragile, in the sense that when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays that make that arm appear to be sub-optimal, thereby causing the algorithm to sample a truly sub-optimal arm much more than would be optimal.
翻译:有关最佳设计土匪算法的大量文献都以尽量减少预期的遗憾为基础。 众所周知, 最优于某些指数式家庭的设计可以实现预期的遗憾, 以拉伊- 罗宾斯较低的约束范围控制, 使手臂游戏数量成正比增长。 在本文中, 我们显示, 当使用这种优化设计时, 相关的算法必然具有不良特征, 遗憾分配的尾巴会像一个松散的卡乌斯分布的尾巴那样。 此外, 对于 $>1 美元, 最优于某些指数式家庭的设计可以实现预期的遗憾分配第一刻的增长速度比多得多, 特别是作为次最佳手臂游戏数量的动力。 我们显示, 优化的汤普森采样和UCBbbandit 设计也很脆弱, 从某种意义上说, 当问题甚至稍有误定义时, 遗憾分配的尾巴比传统理论所暗示的要快得多。 我们的论点基于标准的改变度观念, 并且表明, 最有可能的遗憾方式比预期的要大得多的是, 当最优于最优的手臂回回平平等的手臂后, 使最优的手臂的次的手臂变成最优于最优于最优的骨架压的次。