加紧探索高信任程度学习 (Tightening Exploration in Upper Confidence Reinforcement Learning)

from arxiv, Appeared in Proceedings of the 27th International Conference on Machine Learning (ICML 2020). This is an improved post-proceeding version correcting minor errors

The upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore, to tighten exploration, it uses an adaptive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate, through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds.

翻译：在(Jaksch等人,2010年)中引入的上层增强信任学习(UCRL2)算法是一种流行的方法,用于根据平均回报标准,在未知的离散马尔科夫决策流程中进行最遗憾的最小化。尽管这种算法及其变式提供了良好和通用的理论遗憾保证,但迄今为止基本上还是理论性的,因为在学习之前,在简单环境中的数值实验显示长时间的燃烧阶段,在学习之前,这种算法及其变式在理论上一直停留在理论上。为了追求实际效率,我们按照UCRL2的路线展示了UCRL3, 但它有两种关键的修改:第一,它使用最先进的时间-统一集中不平等来计算每对州行动配对的奖赏和(从构成部分到)过渡分配。此外,为了收紧探索,它采用了对每项过渡分配支持的适应性计算方法,这反过来使我们能够重新审视UCRL2的扩展值程序,以便通过不考虑低概率的转变,在确保接近乐观的环境下进行优化分配。我们通过在标准环境中的量化试验,通过这种方式减少勘探,使我们比UCR3L2和约束的直径更能更有利于。