This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed and optimal policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned policy is provably optimal among all local policies and its regret bounds depend on the dimension of local states and actions. This distinguishes our result from most existing results on MARL, which often obtain stationary-point policies. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.
翻译:本文研究的是一组多试剂强化学习(MARL)问题,其中,代理人得到的奖励取决于其他代理人的状态,但下一个国家只取决于代理人本身的目前状态和行动。我们命名它为REC-MARL,作为REward-Cuped多代理人强化学习的常设单位。REC-MARL有一系列重要的应用,例如实时接入控制和无线网络的分散电力控制。本文介绍了REC-MARL的分布式最佳政策梯度算法。提议的算法分为两个方面:(一) 所学的政策是分布式政策,绘制一个代理人的当地状态图示到其当地的行动和动作。 (二) 教学/培训是分布式的,每个代理人根据自己和邻居的信息更新其政策。所学的政策是最佳的,其遗憾界限取决于当地各州和行动的方方面面。这把我们的结果与MARL的大多数现有结果区别开来,而MARL往往获得固定点政策。我们关于无线网络实时访问控制和权力控制的国家算法的实验结果显示我们的政策非常超出我们的政策标准。