Despite its popularity in the reinforcement learning community, a provably convergent policy gradient method for general continuous space-time stochastic control problems has been elusive. This paper closes the gap by proposing a proximal gradient algorithm for feedback controls of finite-time horizon stochastic control problems. The state dynamics are continuous time nonlinear diffusions with controlled drift and possibly degenerate noise, and the objectives are nonconvex in the state and nonsmooth in the control. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem, and is stable with respect to policy updates by approximate gradient steps. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularization or a fictitious discount factor to the optimization objective accelerates the convergence of policy gradient methods. The proof exploits careful regularity estimates of backward stochastic differential equations.
翻译:尽管在强化学习界很受欢迎,但对于一般连续空间时间的随机控制问题,一种可察觉的趋同政策梯度方法一直难以找到。本文件通过提出一种对有限时间-地平线随机控制问题进行反馈控制的最接近的梯度算法来弥补差距。 状态动态是持续的时间非线性扩散,有受控的漂移和可能退化的噪音,目标在状态中是非线性扩散,在控制状态中是非线性扩散。 我们证明,在适当条件下,算法线性集中到控制问题的固定点,并且以近似梯度步骤更新政策是稳定的。 趋同结果证明,最近强化的超常性学习将诱变规范化或假折扣因素加入优化目标加速了政策梯度方法的趋同。 证据利用了对后相相切差异方程式的审慎定期估计。