镜底政策优化 (Mirror Descent Policy Optimization)

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks. Code is available at \url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.

翻译：镜像底部(MD),这是在受限制的曲线优化中广为人知的第一阶方法,{最近被证明是分析信任区域在强化学习中的算法(RL)的一个重要工具。然而,在这种理论上分析的算法与实践中所使用的算法之间仍然存在着巨大的差距。受此启发,我们提议了一个高效的RL算法,称为“反镜底部政策优化” (MDPO) 。MDPO 反复更新了该政策,解决了一个信任区域问题,其目标功能包括两个条件:标准RL目标和近距离术语的线性化,限制两个连续的政策相互接近。每次更新都通过在这个目标函数上采取多个梯度步骤来进行这种近距离的近距离。我们从这个角度上推导出 $em-poc 政策和 em- off- polid} 变式的MDPO。我们强调在政策上与两个普通区域调控法之间的关联性关系:TRPO和PPO- 明确执行区域限制,而我们从从从一个实际衍生出的R-MI-PO-PO 战略中推算出高PO-POL-S-S-C 显示高业绩的收益。