信息分散强化学习的遗憾弹道 (Regret Bounds for Information-Directed Reinforcement Learning)

Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.

翻译：信息导向抽样(IDS)揭示了它作为数据效率高的强化学习算法(RL)的潜力。然而,对用于Markov决定过程的IDS的理论理解仍然有限。我们开发了新的信息理论工具,以约束信息比率和关于学习目标的累积信息。我们的理论结果揭示了选择学习目标的重要性,使实践者能够平衡计算和遗憾界限。结果,我们从香草-IDS获得事先免费的巴耶斯式遗憾,这些香草-IDS在表格有限偏差 MDP下学习整个环境。此外,我们提议采用一种计算效率高的正规化IDS,最大限度地增加一种添加形式,而不是比率形式,并表明它享有与香草-IDS相同的遗憾。在率扭曲理论的帮助下,我们通过学习一种代孕、信息较少的环境来改善遗憾的束缚。此外,我们将我们的分析扩大到线性 MDPs,并证明Thompson抽样的类似遗憾界限是副产品。