This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints, regional constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and transparency reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning.
翻译:本文展示了受限的政策梯度算法。 我们引入了安全学习的限制, 包括以下步骤。 首先, 学习会放慢速度( 粗略学习), 从而在政策梯度理论和神经相向内核的帮助下计算出偶发政策变化。 然后, 这样我们也可以对任意状态的政策进行评估 。 本着同样的精神, 学习可以指导, 通过增加事件批量来确保安全, 并指定所需行动概率的州; 最后, 可以在特定州- 行动对口中计算出未来回报( 回报) 的外源折扣总和, 这样政策网络就能够满足限制 。 计算回报时, 要基于解决线性方程式( 平等制约) 或受限的二次方程式( 不平等制约、 区域制约 ) 。 模拟结果表明, 增加限制( 外部信息) 能够提高学习速度和透明度方面的学习, 如果适当选择了限制的话, 。 限制学习的效率在卡托波尔 和 Lunar Lander OpenAI 体操练环境的浅而宽广的ReLU 网络中表现了 。 。 论文的主要创新是加强 。