关于功能接近性强化学习的行为者-捷克方法的复杂程度 (On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation)

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, providing insight into the interplay between optimization and generalization in reinforcement learning.

翻译：由Markov Condises以数学方式描述的加强学习,可以通过动态的编程或政策搜索来进行。动画- 批评算法将两种方法的优点结合起来,在估算价值函数的步骤和政策梯度更新之间交替使用。由于更新显示的是相关的噪音和偏差梯度更新,因此,只有演员-critical的无症状行为才通过将其行为与动态系统联系起来而为人们所知道。这项工作提出了在政策搜索更新期间使用蒙特卡洛推出的新的演艺-critic 方法的变种,这导致了可控制的偏差,而这种偏差取决于批评评价的数量。因此,当政策搜索步骤使用政策梯度和偏差梯度更新时,我们第一次能够提供演员-critical 算法的趋同率的趋同率。特别是,我们创造了一些条件,在将抽样复杂性与非凝固问题相近的梯度梯度梯度计算方法相仿,或者由于批评性估计错误,这是主要的复杂瓶颈。这些结果在持续状态和行动空间中保持着直线性功能近点,在数值接近值接近值接近值,在政策梯度上, 当政策搜索度作用作用作用中,我们用的是, 度学习周期性变压的变压率性变压率性变压,这些结果提供了这些结果。