Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.
翻译:在很多具有挑战性的应用中, 行为者- 批评方法取得了显著的成功。 但是, 它的有限时间趋同仍然在最实际的形式上没有得到很好的理解。 分析单一时间级行为者- 批评的现有工作仅侧重于i. d. 抽样或表格设置以简单化。 我们认为, 在连续的国家空间上, 比较实用的在线单一时间级行为者- 批评算法, 批评者在持续的国家空间上以每个行为者一步的单一马尔科维亚样本进行更新。 现有的分析无法得出这样一个具有挑战性案例的趋同。 我们证明, 在线单一时间级行为者- 批评方法可以保证找到一个美元/ eplon$- 接近的固定点, 在标准假设下, 仅仅侧重于 i. d. (\\ epslon% 2} ) 或 表格设置 。 我们开发了一个新的框架, 来系统地评估和控制行为者- 批评者- 批评者之间的错误传播。 对于我们的知识来说, 这是第一次对最具有挑战性的单一时间级假设的假设的精确时间性分析, 来比较我们目前最具有挑战性的假设中最强的比较。