We propose a novel policy gradient method for multi-agent reinforcement learning, which leverages two different variance-reduction techniques and does not require large batches over iterations. Specifically, we propose a momentum-based decentralized policy gradient tracking (MDPGT) where a new momentum-based variance reduction technique is used to approximate the local policy gradient surrogate with importance sampling, and an intermediate parameter is adopted to track two consecutive policy gradient surrogates. Moreover, MDPGT provably achieves the best available sample complexity of $\mathcal{O}(N^{-1}\epsilon^{-3})$ for converging to an $\epsilon$-stationary point of the global average of $N$ local performance functions (possibly nonconcave). This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning, and when initialized with a single trajectory, the sample complexity matches those obtained by the existing decentralized policy gradient methods. We further validate the theoretical claim for the Gaussian policy function. When the required error tolerance $\epsilon$ is small enough, MDPGT leads to a linear speed up, which has been previously established in decentralized stochastic optimization, but not for reinforcement learning. Lastly, we provide empirical results on a multi-agent reinforcement learning benchmark environment to support our theoretical findings.
翻译:我们为多试剂加固学习提出了一个新的政策梯度方法,该方法利用两种不同的减少差异技术,不需要大量迭代。具体地说,我们建议采用基于动力的分散政策梯度跟踪(MDPGT),采用新的基于动力的减少差异技术,以重要取样方式接近当地政策梯度代用器,并采用中间参数跟踪连续两个政策梯度代代谢器。此外,MDPGT可以明显地达到美元(mathcal{O}(N ⁇ -1 ⁇ - ⁇ - ⁇ - ⁇ 3})的现有最佳样本复杂性,用于凝聚到美元全球平均当地绩效功能(可能无法兼容)的固定点。这超出了分散式无型加固化学习中最先进的样本复杂性。此外,MDPGT在以单一轨迹初始化时,抽样复杂性与现有分散化政策梯度方法所获得的相同。我们进一步验证了高斯政策功能的理论要求。当所需的差分率美元是小的零分数,而当地平均值是全球平均值的固定点(可能不相容的),这比得上一个基础,而MDPGTGTGT在前的升级阶段学习了我们最先先先先先获得的理论级的升级的升级。