The multi-armed bandit (MAB) problem is a widely studied model in the field of reinforcement learning. This paper considers two cases of the classical MAB model -- the light-tailed reward distributions and the heavy-tailed, respectively. For the light-tailed (i.e. sub-Gaussian) case, we propose the UCB1-LT policy, achieving the optimal $O(\log T)$ of the order of regret growth. For the heavy-tailed case, we introduce the extended robust UCB policy, which is an extension of the UCB policies proposed by Bubeck et al. (2013) and Lattimore (2017). The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions, which can be hard to acquire in some practical situations. Our extended robust UCB eliminates this requirement while still achieving the optimal regret growth order $O(\log T)$, thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.
翻译:多武装土匪问题(MAB)是强化学习领域广泛研究的一个模式,本文分别考虑了经典马巴模式的两个案例 -- -- 轻尾派奖赏分配和重尾派。对于轻尾派(即南加苏西安)案件,我们建议采用UCB1-LT政策,在遗憾增长顺序中达到最佳的美元(glog T)美元;对于重尾派,我们引入了扩大的强力联合银行政策,这是Bubeck等人(2013年)和Lattimore(2017年)提出的统一银行政策的一个延伸;以往的《统一银行政策》要求了解在具体奖赏分配时间的上限,在某些实际情况下可能很难获得。我们扩大的强力联合银行消除了这一要求,同时仍然达到最佳的遗憾增长顺序$(glog T),从而扩大了《统一银行政策》对重尾派奖金分配的应用范围。