Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
翻译:完全分散学习,而全球信息,即其他代理人的行动,是无法获得的全球信息,是合作性多剂强化学习的一个根本挑战;然而,大多数分散算法的趋同性和最佳性在理论上没有保证,因为过渡概率是非静止的,因为所有代理人同时更新政策。为了应对这一挑战,我们建议尽可能最佳的操作者,一个新的分散操作者,并证明如果每个代理人独立更新操作者各自的国家行动价值,代理者的政策将趋于最佳的联合政策。此外,为了使更新更加有效和实用,我们简化操作者,并证明与简化算法的趋同和最佳性仍然维持不变。通过即时化简化操作者,衍生的完全分散算法,最佳的Q学习(BQL)不会因非常态性而受到影响。我们很生动地表明,BQL在各种合作性多剂任务的基准方面取得了显著的改进。