While off-policy temporal difference (TD) methods have widely been used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have not been utilized as frequently. One reason is that the non-linear max operation in the Bellman optimality equation makes it difficult to define conjugate distributions over the value functions. In this paper, we introduce a novel Bayesian approach to off-policy TD methods using Assumed Density Filtering (ADFQ), which updates beliefs on state-action values (Q) through an online Bayesian inference method. Uncertainty measures in the beliefs provide a natural regularization for learning, and we show how ADFQ reduces in a limiting case to the traditional Q-learning algorithm. Our empirical results demonstrate that the proposed ADFQ algorithms outperform comparable algorithms on several task domains. Moreover, our algorithms are computationally more efficient than other existing approaches to Bayesian reinforcement learning.
翻译:虽然政策外时间差异(TD)方法因其效率和简单实施而被广泛用于强化学习,但贝耶斯对等方法没有经常使用,其中一个原因是贝尔曼最佳化方程式的非线性最大操作使得难以界定对价值函数的共性分布。在本文中,我们引入了一种新型的贝耶斯方法,即采用假设密度过滤法(ADFQ),通过在线巴耶斯推断法更新了国家行动值的信念(Q )。信仰中的不确定性措施为学习提供了一种自然的正规化,我们展示了ADFQ在限制情况下如何减少传统的Q学习算法。我们的经验结果表明,拟议的ADFQ算法在几个任务领域超过了可比的算法。此外,我们的算法在计算上比其他现有的巴伊西亚强化学习方法效率更高。