This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL), where agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns. Due to the non-concave performance function of policy gradient, the existing distributed stochastic optimization methods for convex problems cannot be directly used for policy gradient in MARL. This paper proposes a distributed policy gradient with variance reduction and gradient tracking to address the high variances of policy gradient, and utilizes importance weight to solve the {distribution shift} problem in the sampling process. We then provide an upper bound on the mean-squared stationary gap, which depends on the number of iterations, the mini-batch size, the epoch size, the problem parameters, and the network topology. We further establish the sample and communication complexity to obtain an $\epsilon$-approximate stationary point. Numerical experiments are performed to validate the effectiveness of the proposed algorithm.
翻译:本文研究多试剂合作强化学习(MARL)中分布式政策梯度,通信网络的代理商旨在寻找最佳政策,最大限度地提高所有代理商当地回报的平均值。由于政策梯度的非混凝土性能功能,现有分布式锥形问题随机优化方法不能直接用于MARL的政策梯度。本文建议采用分布式政策梯度,减少差异和梯度跟踪,以解决政策梯度差异很大的问题,并利用重要权重解决取样过程中的{分配转移}问题。然后,我们提供了平均成比例的固定差距的上限,这取决于迭代数、小批量尺寸、小区大小、问题参数和网络地形。我们进一步建立抽样和通信复杂性,以获得$-epslon$-appoint satary点。我们进行了数值实验,以验证提议的算法的有效性。