强化学习中在线行为者-批评方数值的ODE限制全球趋同 (Global Convergence of the ODE Limit for Online Actor-Critic Algorithms in Reinforcement Learning)

Actor-critic algorithms are widely used in reinforcement learning, but are challenging to mathematically analyze due to the online arrival of non-i.i.d. data samples. The distribution of the data samples dynamically changes as the model is updated, introducing a complex feedback loop between the data distribution and the reinforcement learning algorithm. We prove that, under a time rescaling, the online actor-critic algorithm with tabular parametrization converges to an ordinary differential equations (ODEs) as the number of updates becomes large. The proof first establishes the geometric ergodicity of the data samples under a fixed actor policy. Then, using a Poisson equation, we prove that the fluctuations of the data samples around a dynamic probability measure, which is a function of the evolving actor model, vanish as the number of updates become large. Once the ODE limit has been derived, we study its convergence properties using a two time-scale analysis which asymptotically de-couples the critic ODE from the actor ODE. The convergence of the critic to the solution of the Bellman equation and the actor to the optimal policy are proven. In addition, a convergence rate to this global minimum is also established. Our convergence analysis holds under specific choices for the learning rates and exploration rates in the actor-critic algorithm, which could provide guidance for the implementation of actor-critic algorithms in practice.

翻译：由于非i.id.d.d.数据样本的在线抵达,在数学上分析数据样本时具有挑战性。随着模型的更新,数据样本的分布动态变化,引入了数据分布和强化学习算法之间的复杂反馈环环。我们证明,在时间调整下,带有表表单超光度的在线行为者-批评算法随着更新数量增加而与普通差异方程式(ODE)相融合。证据首先确定数据样本在固定行为者政策下具有几何异性。然后,使用Poisri 方程式,我们证明数据样本围绕动态概率度的波动,这是不断演变的行为者模型的函数,随着更新数量的增加而消失。一旦得出ODE的极限,我们用两种时间尺度分析方法研究其趋同特性,这些时间尺度对评论者 ODE 和评论者 ODE 的数值将首先确定数据样本与Bellman 方程式的解决方案和行为者与最佳政策趋同率的趋同性。在进行精确的演算法分析时,也可以证明,在最起码的演算法中进行这种趋同率。