We introduce a new constrained optimization method for policy gradient reinforcement learning, which uses two trust regions to regulate each policy update. In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose to form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory buffer of past policies, providing a new capability for dynamically selecting appropriate trust regions during the optimization process. Our proposed method, dubbed as Memory-Constrained Policy Optimization (MCPO), is examined on a diverse suite of environments including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.
翻译:我们为政策梯度强化学习采用一种新的限制优化方法,利用两个信任区域来监管每项政策更新。除了像以往工作那样将一个单一的旧政策作为第一个信任区域,我们提议通过建立代表一系列过去政策的另一种虚拟政策来形成第二个信任区域。然后我们强制执行新政策,以便更贴近虚拟政策,这在旧政策表现不佳时是有益的。更重要的是,我们提议了一个机制,从过去政策的记忆缓冲中自动建立虚拟政策,为在优化过程中积极选择适当的信任区域提供新的能力。我们提出的方法被称为 " 记忆约束政策优化(MCPO) " (MCPO),是在多种环境中加以审查的,包括机器人移动控制、带微量奖的导航和Atari游戏,不断展示与最近受政策制约的政策梯度方法的竞争表现。