Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e.g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.
翻译:根据对参数模型的解释(例如价值相对于政策代表性)或如何制定学习目标,政策优化方法的动机来自不同的原则,基于对参数模型的解释(例如,价值相对于政策代表性)或如何制定学习目标,然而,它们有一个共同的共同目标,即最大限度地实现预期回报。为了更好地捕捉共同点,并找出政策优化方法之间的关键差异,我们制定了统一的观点,从有限的梯度形式选择和比例计算功能的角度重新表达基本更新内容。特别是,我们确定了政策优化的粗略梯度更新的参数空间,该空间结构性很强,但涵盖了传统和近期的实例,包括PPPO。结果,我们获得了新颖但动机良好的更新,将现有算法概括化,从而在趋同速度和最终结果质量两方面都能够带来效益。一项试验性调查表明,在参数化的更新大家庭中提供的更多程度的自由可以被利用,以便在合成领域和流行的深RL基准上获得非三重改进。