We use functional mirror ascent to propose a general framework (referred to as FMA-PG) for designing policy gradient methods. The functional perspective distinguishes between a policy's functional representation (what are its sufficient statistics) and its parameterization (how are these statistics represented) and naturally results in computationally efficient off-policy updates. For simple policy parameterizations, the FMA-PG framework ensures that the optimal policy is a fixed point of the updates. It also allows us to handle complex policy parameterizations (e.g., neural networks) while guaranteeing policy improvement. Our framework unifies several PG methods and opens the way for designing sample-efficient variants of existing methods. Moreover, it recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) in a principled way. With a softmax functional representation, FMA-PG results in a variant of TRPO with additional desirable properties. It also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on MuJoCo. Via experiments on simple reinforcement learning problems, we evaluate algorithms instantiated by FMA-PG.
翻译:我们使用功能镜镜来提议一个用于设计政策梯度方法的一般框架(称为FMA-PG),功能视角区分了政策的职能代表性(即足够的统计数据)及其参数化(这些统计数据如何体现)和在计算高效的离政策更新中自然产生的结果。对于简单的政策参数化,FMA-PG框架确保最佳政策是更新的固定点,还使我们能够处理复杂的政策参数化(例如神经网络),同时保证政策改进。我们的框架统一了几个PG方法,为设计现有方法的样本高效变异开辟了途径。此外,它以有原则的方式回收了重要的执行超常(例如,使用前方数据还是逆向KL差异) 。有了软式功能化的功能化,FMA-PG将产生具有额外可取属性的TRPO变异。它还建议改进PPO的变式,我们在MuJoCo. Via实验中以经验方式演示关于简单强化学习问题的精度和效率。我们评估FMA-PG的算法。