Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. The existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, which is not often adopted in real implementation. In this work, we introduce a fully decentralized AC algorithm, where actor, critic, and global reward estimator are updated in an alternating manner with step sizes being of the same order, namely, we adopt the \emph{single-timescale} update. Theoretically, using linear approximation for value and reward estimation, we show that our algorithm has sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a log term). The sample complexity can be improved to ${\mathcal{O}}(\epsilon^{-2})$ under the i.i.d. sampling scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.
翻译:分散的 Act- Critic (AC) 算法已被广泛用于多试剂强化学习( MARL), 并取得了显著的成功 。 除了其经验性的成功外, 分散的 AC 算法的理论趋同属性基本上尚未探索。 现有的有限时间趋同结果基于双环更新 或两度级级制规则, 这在实际执行中并不经常被采用 。 在这项工作中, 我们引入了完全分散的 AC 算法, 该算法的行为者、 评论家和全球奖赏估计值以不同的方式以不同的方式更新, 步数是相同的顺序 。 也就是说, 我们采用 eemph { sing- le- 时间标度} 更新。 理论上, 我们使用线性近近接近值和奖赏估算, 我们的算法在 Markovian 取样中具有精度的精度 $\\ {( ) ( eeplational- commissional) a cregistrical rual registrational regidustrational a.