We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.
翻译:我们引入了一种政策改进方法,将基于价值的强化学习(RL)的贪婪方法与典型基于模型的强化学习(RL)的全面规划方法相交织。 新的方法以几何地平面模型(GHM,又称伽马模型)的概念为基础,该模型模拟了某一政策的折扣国家访问分布。我们表明,我们可以通过审慎地组成基础政策GHM,来评估在一组基点马可夫政策之间转换固定概率的非马尔科夫政策,而无需再加任何学习。 然后,我们可以对此类非马尔科夫政策集采用一般化的政策改进(GPI),以获得新的Markov政策,该政策将全面超越其前体。我们对这一方法进行透彻的理论分析,开发转让应用程序和标准RL标准,并用经验证明它在一项具有挑战性的深RL连续控制任务方面对标准GPI的有效性。我们还分析了GM培训方法,证明以前提出的方法的新趋同结果,并展示如何在深度的RL环境中对这些模型进行精确的培训。