Since its introduction a decade ago, \emph{relative entropy policy search} (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique that uses \emph{generative} access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
翻译:自十年前引入以来, \ emph{ relative entropy policy search} (REPS) 在许多模拟和真实世界机器人域上展示了成功的政策学习, 更不用说提供最近许多拟议强化学习(RL)算法所使用的算法的算法组成部分了。 虽然REPS在社区中广为人知, 但是在使用随机和梯度求解器时对其性能没有保障。 在本文件中, 我们的目标是填补这一差距, 为利用对REPS 目标应用的第一阶优化方法所学的政策的亚最佳性提供保障和趋同率。 我们首先考虑我们获得精确梯度的设置, 并展示目标近于最优化的程度如何将政策转化为接近最优化。 我们然后考虑对随机梯度的实用设置, 并引入一种技术, 使用 emph{ generatr} 来对基本的Markov 决策程序进行配置参数更新, 以保持对最佳常规政策有利的趋同。