We consider the problem of control in an off-policy reinforcement learning (RL) context. We propose a policy gradient scheme that incorporates a smoothed functional-based gradient estimation scheme. We provide an asymptotic convergence guarantee for the proposed algorithm using the ordinary differential equation (ODE) approach. Further, we derive a non-asymptotic bound that quantifies the rate of convergence of the proposed algorithm.
翻译:我们从政策外强化学习(RL)的角度来考虑控制问题。我们提出了一个政策梯度计划,其中包括一个基于功能的平滑梯度估算计划。我们为使用普通差分方程(ODE)方法的拟议算法提供了无症状趋同保证。此外,我们得出了一个非症状约束,以量化拟议算法的趋同率。