We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.
翻译:我们在无限的Horizon折扣零和马尔科夫游戏中研究多剂强化学习(MARL),我们注重分散式MARL的实际但具有挑战性的情景,即代理人在没有中央控制者协调的情况下作出决定,但只能根据自己的报酬和当地行动来作出决定。代理人不需要观察对手的行动或报酬,甚至可能忽视对手的存在,也不了解基本游戏的零和结构,这种环境在游戏中学习的文献中也被称为根本不相联。在本文中,我们开发了一个完全不相联的Q学习动态,这种动态既合理又趋汇:学习动态与对手战略的最佳反应相趋合,而对手则遵循的是零和固定的战略;当两个代理人都采用学习动态时,它们就会与游戏的纳什平衡汇合。这种分散式环境的主要挑战是,环境从代理人的视角来看是不常态的,因为她自己的报酬和系统演变都取决于其他代理人的行动,而每个代理人的学习动态都同时和独立地调整。