Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios, which is termed as policy optimization. By viewing the reinforcement learning from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the policy optimization may lead to suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC. As such we choose evolving actions from the base policy according to HMC, which has two benefits: i) HMC can improve the policy distribution to better approximate the posterior and hence reduces the amortization gap; ii) HMC can also guide the exploration more to the regions with higher action values, enhancing the exploration efficiency. Instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods. Besides, the proposed approach can also outperform previous methods on DeepMind Control Suite which has image-based high-dimensional observation space.
翻译:强化学习的最佳政策(RL)在许多现实世界情景中往往是必要的,这被称作政策优化。通过从差异推导(VI)的角度看待强化学习,政策网络接受培训,以获得符合最佳标准的行动的近似后部;然而,在实践中,政策优化可能导致由于摊销差距和探索不足而导致政策估计低于最佳水平。在这项工作中,我们提议将政策优化与HMC相结合。我们从HMC的基本政策中选择了不断演变的行动,这有两个好处:(i) HMC可以改进政策分配,以更好地接近后部,从而缩小摊销差距;(ii) HMC还可以指导更多以更高行动价值的地区进行探索,提高勘探效率。我们建议由HMC直接应用于RL,而不是直接将HMC用于模拟汉密尔顿动态。我们提议了对连续控制基线的全面实验,包括Mujoco和PyBure-ByBlest的改进,这有两种好处:(i) HMC可以改进政策分配,我们提出的高额控制方法也能够显示以往的压式方法。