Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.
翻译:政策梯度(PG)算法是将强化学习应用到实际世界控制任务(如机器人)的预期应用中的最佳选择。然而,这些方法的试验和高度性质在学习过程本身必须在物理系统上进行或涉及任何形式的人体计算机互动时都会产生安全问题。在本文中,我们处理的是具体的安全配方,其中目标和危险都以标价奖励信号编码,而学习代理商则受限制,其性能永远不会恶化,以预期的奖励总和来衡量。通过从随机优化角度研究仅以行为者为对象的政策梯度,我们为一系列广泛的参数政策建立了改进保障,将现有的高斯政策结果加以概括化。这加上政策梯度定值差异的新的上限,使我们得以确定保证以高概率实现单调改进的元参数时间表。两个主要的元参数参数参数是参数更新的阶梯度大小和梯度估计的成批量尺寸。我们通过联合、适应性选择这些代数参数来获得政策梯度的梯度,我们通过单调改进的阶梯度保证获得政策梯度的梯度。