部分可遵守的中度平均加强外地培训 (Partially Observable Mean Field Reinforcement Learning)

from arxiv, Paper to be published in International Conference on Autonomous Agents and Multiagent Systems (AAMAS) - 2021. New version has some typos corrected

Traditional multi-agent reinforcement learning algorithms are not scalable to environments with more than a few agents, since these algorithms are exponential in the number of agents. Recent research has introduced successful methods to scale multi-agent reinforcement learning algorithms to many agent scenarios using mean field theory. Previous work in this field assumes that an agent has access to exact cumulative metrics regarding the mean field behaviour of the system, which it can then use to take its actions. In this paper, we relax this assumption and maintain a distribution to model the uncertainty regarding the mean field of the system. We consider two different settings for this problem. In the first setting, only agents in a fixed neighbourhood are visible, while in the second setting, the visibility of agents is determined at random based on distances. For each of these settings, we introduce a Q-learning based algorithm that can learn effectively. We prove that this Q-learning estimate stays very close to the Nash Q-value (under a common set of assumptions) for the first setting. We also empirically show our algorithms outperform multiple baselines in three different games in the MAgents framework, which supports large environments with many agents learning simultaneously to achieve possibly distinct goals.

翻译：传统的多试剂强化学习算法无法伸缩到有不止几个代理商的环境,因为这些算法在代理商数量上是指数化的。最近的研究采用了成功的方法,将多试剂强化学习算法推广到使用中场理论的许多代理商情景中。这个领域以前的工作假设一个代理商可以获得关于系统平均实地行为的精确累积的衡量标准,然后可以用来采取行动。在这个文件中,我们放松这一假设,并维持一种分配,以模拟系统中正域的不确定性。我们考虑了这一问题的两个不同的设置。在第一个设置中,只有固定周边的代理商是可见的,而在第二个设置中,代理商的能见度是随机根据距离决定的。对于每一个这些设置,我们引入了基于Q学习的算法,可以有效地学习。我们证明,这种Q学习估计方法与第一个设置的纳什 Q值(根据一套共同的假设)非常接近。我们还从经验上显示我们的算法在三个不同的游戏中优于多个基线,支持大型环境,许多代理商同时学习可能实现不同的目标。