Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
翻译:在强化学习(RL)方面,特别是当奖赏稀少时,直接探索是强化学习(RL)的关键挑战。信息导向抽样(IDS)优化了信息比率,试图通过增加信息获取的遗憾来做到这一点。然而,估计信息获取在计算上是棘手的,或者依赖限制性的假设,在许多实际实例中禁止使用这些信息。在这项工作中,我们从当前对过渡模式和未知最佳的概率衡量(IP)之间的整体概率衡量(IP)方面,从当前对过渡模式和未知的最佳模式的估计(IP)的角度,从整体概率衡量(IP)的角度出发,在适当条件下,可以以封闭的形式与内核化的石质差异(KSD)一起计算。根据KSD,我们开发了一个新型的算法STEERING :\ textb{STE} 在信息中,\ textbf{STE} 中,我们开发了一个新型的算法, 用于模型为基础的MDS, 包括了先前的计算法。