Despite the great empirical success of actor-critic methods, its finite-time convergence is still poorly understood in its most practical form. In particular, the analysis of single-timescale actor-critic presents significant challenges due to the highly inaccurate critic estimation and the complex error propagation dynamics over iterations. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity, which is rarely the case in practical applications. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under i.i.d. sampling. Our analysis develops a novel framework that evaluates and controls the error propagation between actor and critic in a systematic way. To our knowledge, this is the first finite-time analysis for online single-timescale actor-critic method. Overall, our results compare favorably to the existing literature on analyzing actor-critic in terms of considering the most practical settings and requiring weaker assumptions.
翻译:尽管行为者-批评方法取得了巨大的实证成功,但其有限的时间趋同仍然以最实际的形式不太能理解。特别是,对单一时间尺度的行为者-批评的分析提出了重大挑战,因为对迭代的批评估计高度不准确,而且复杂的错误传播动态。现有的单一时间尺度的行为者-批评分析工作仅侧重于i.d.抽样或简单化的表格设置,这在实际应用中很少如此。我们认为,在连续的州空间上,更实用的在线单一时间尺度的行为者-批评方-批评方的算法是实用的,在这种空间上,评论方以每个行为者一步的样本更新。我们证明,在线的单一时间尺度的行为者-批评方方法有保证找到一个美元和美元接近的固定点,在标准假设下,这在实际应用中是罕见的。我们可以进一步改进到$\mathcal-cal-cal-clusion i.d.我们的分析将第一个新时间尺度框架发展到一个需要我们系统化的行为者-分析方法,用来评估并控制整个行为者-时间尺度的现有分析方法。