Actor-critic style two-time-scale algorithms are very popular in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the global convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory. Our analysis applies to very general settings, as we only assume that the underlying Markov chain is ergodic under all policies (the so-called Recurrence assumption). We employ $\epsilon$-greedy sampling in order to ensure enough exploration. For a fixed exploration parameter $\epsilon$, we show that the natural actor critic algorithm is $\mathcal{O}(\frac{1}{\epsilon T^{1/4}}+\epsilon)$ close to the global optimum after $T$ iterations of the algorithm. By carefully diminishing the exploration parameter $\epsilon$ as the iterations proceed, we also show convergence to the global optimum at a rate of $\mathcal{O}(1/T^{1/6})$.
翻译:在强化学习中,Actor-critic 风格的双时间级算法非常受欢迎,并取得了巨大的实证成功。 但是,它们的性能在理论上并不完全被人们所理解。 在本文中,我们用单一的轨迹来描述表格环境中在线自然行为者-critic 算法的全球趋同。 我们的分析适用于非常一般的设置, 我们仅仅假设所有政策( 所谓的Recurence 假设) 所基于的 Markov 链系都是自动的。 我们使用 $\ epsilon $- greedy 的采样以确保足够的勘探。 对于固定的勘探参数 $\ epselon$ 来说, 我们显示自然演员的批评算法是$\ mathcal{ O} (\\\\ {\\\ \ \\ \\\\\ \ \ \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \