稳定、高效强化学习的一般代用功能类别 (A general class of surrogate functions for stable and efficient reinforcement learning)

from arxiv, v2 with revisions to the writing, new title. Previous version was under the title "A functional mirror ascent view of policy gradient methods with function approximation"

Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.

翻译：共同的政策梯度方法依赖于代用功能序列的最大化。近年来,提出了许多替代功能,大多没有强有力的理论保证,导致产生TRPO、PPO或MOPO等算法。我们没有设计另一个代用功能,而是提议基于功能镜状的总框架(FMA-PG),导致产生整个代用功能系列的代用功能。我们构建替代功能,使政策改进保障成为可能,而这种属性不是多数现有代用功能所共有的。关键是,这些替代功能不论选择政策参数化,都具有。此外,FMA-PG的特殊即时性恢复了重要的执行超常(例如,使用前向式或逆向KL差异),导致TRPO的变异式,具有额外的可取属性。Via关于简单的波段问题的实验,我们评估FMA-PG所即时的算算算算法。拟议框架还提出了改进的代用变量,我们从经验上在MuJoco套件上证明了这种变式的坚固性和效率。