Despite remarkable successes, deep reinforcement learning algorithms remain sample inefficient: they require an enormous amount of trial and error to find good policies. Model-based algorithms promise sample efficiency by building an environment model that can be used for planning. Posterior Sampling for Reinforcement Learning is such a model-based algorithm that has attracted significant interest due to its performance in the tabular setting. This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning that retains its model-based essence. PSDRL combines efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation. Extensive experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling while being competitive with a state-of-the-art (model-based) reinforcement learning method, both in sample efficiency and computational efficiency.
翻译:尽管在强化学习领域获得了显着的成功,深度强化学习算法仍然样本效率较低:它们需要大量的试错来找到良好的策略。 模型导向算法通过构建可用于规划的环境模型来实现样本效率。 基于后验抽样的强化学习是这样一种模型导向算法,由于在表格设置中的性能而引起了重视。 本文介绍了后验抽样深度强化学习PSDRL(Posterior Sampling for Deep Reinforcement Learning),它是第一种真正可扩展的后验抽样强化学习的近似算法,同时保留其模型导向特性。PSDRL将潜在状态空间模型上的高效不确定性量化与基于值函数近似的特殊连续规划算法相结合。在Atari基准测试中的广泛实验表明,与以前的尝试相比,PSDRL明显优于现有的后验抽样扩展算法,并在样本效率和计算效率方面与现有的最先进(模型导向)强化学习方法相当。