Cooperative multi-agent reinforcement learning is a decentralized paradigm in sequential decision making where agents distributed over a network iteratively collaborate with neighbors to maximize global (network-wide) notions of rewards. Exact computations typically involve a complexity that scales exponentially with the number of agents. To address this curse of dimensionality, we design a scalable algorithm based on the Natural Policy Gradient framework that uses local information and only requires agents to communicate with neighbors within a certain range. Under standard assumptions on the spatial decay of correlations for the transition dynamics of the underlying Markov process and the localized learning policy, we show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity, incurring a localization error that does not depend on the number of agents and converges to zero exponentially fast as a function of the range of communication.
翻译:多试剂合作强化学习是连续决策的一个分散模式,通过一个网络分布的代理商与邻国进行迭代协作,以最大限度地实现全球(网络范围)奖赏概念的最大化。精确的计算通常涉及复杂程度,与代理人数量成倍的缩放。为了解决这一维度的诅咒,我们根据自然政策梯度框架设计了一个可缩放的算法,该框架使用当地信息,只需要代理商在一定范围内与邻居进行沟通。根据关于马科夫进程和地方化学习政策转型动态相关关系空间衰减的标准假设,我们显示我们的算法与全球最佳政策相汇而成,具有无维统计和计算复杂性,产生本地化错误,不取决于代理人数量,而作为通信范围的一个函数,趋同为零指数速度。