Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). Inspired by such theoretical analyses, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by approximately solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks.
翻译:光子下沉(MD),这是在受限制的凝固优化中广为人知的一阶方法,最近被显示为分析信任区域演算法的重要工具。 在这种理论分析的启发下,我们提议了一个高效的RL算法,称为“反向下坠政策优化”(MDPO)。MDPO通过大致解决信任区域问题,反复更新政策,其目标功能包括两个条件:标准RL目标的线性化和限制连续两个政策相互接近的近距离术语。每个更新都通过在这个目标函数上采取多重梯度步骤来达到这一近似值。我们从政策上推导出 +L 和 +L 的变异性,同时强调目前MDL 的理论驱动的重要设计选择。 我们强调在政策上MDPO和两个广受欢迎的信任区域算法之间的联系:TRPO和PPO, 明确执行信任区域约束是TRPO的一个高绩效收益。 然后,我们展示大众软性行为者-crial 和MPO 的演算法是如何从S-PO的微推导出一个摩式的、MAPL 和MAPL 原则。