We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop for the first time a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; the value function estimates converge to the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key challenge in this decentralized setting is the non-stationarity of the learning environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts their policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.
翻译:我们研究的是无孔折扣零和马尔科夫游戏中的多试剂强化学习(MARL) 。 我们关注的是分散的MARL的实际但具有挑战性的设置,即代理人在没有中央控制者协调的情况下作出决定,但只能根据自己的报酬和当地行动来作出决定。 代理人不需要观察对手的行动或报酬,可能甚至忽视对手的存在,也不了解基本游戏的零和结构,这种背景在游戏中学习的文献中也被称为根本不相容。 在本文中,我们首次开发了一种完全不相交的Q学习动态,这种动态是理性的和趋同的:当对手采取静态战略时,学习动力会与对手战略的最佳反应汇合;当两个代理人都采用动态时,价值估计会与纳什平衡相汇合。 这种分散环境的主要挑战是从代理人的角度看学习环境的不常态,因为她自己的报酬和系统演进都取决于其他代理人的行动,每个代理人都独立地更新了它们的时间动态和每个代理人的周期。