This paper proposes novel, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (1) Nash DQN algorithm, which integrates DQN with a Nash finding subroutine for the joint value functions; and (2) Nash DQN Exploiter algorithm, which additionally adopts an exploiter for guiding agent's exploration. Our algorithms are the practical variants of theoretical algorithms which are guaranteed to converge to Nash equilibria in the basic tabular setting. Experimental evaluation on both tabular examples and two-player Atari games demonstrates the robustness of the proposed algorithms against adversarial opponents, as well as their advantageous performance over existing methods.
翻译:本文提出了新颖的、端到端的强化深层学习算法,用于学习双玩家零和马尔科夫游戏。我们的目标不同于以往训练代理人打打固定的对手的努力,我们的目标是找到纳什平衡政策,即使敌对对手也不受纳什平衡政策的利用。我们提议:(1)纳什-DQN算法,将纳什-DQN算法与纳什发现联合价值功能的子路程相结合;(2)纳什-DQN Exploiter算法,该算法进一步采用一个利用者来指导代理人的探索。我们的算法是理论算法的实用变体,保证在基本表格设置中与纳什·艾基利比亚(Nash equilibria) 相汇合。对表格示例和两个玩家阿塔里游戏的实验性评价显示了拟议的算法对敌对对手的稳健性及其优于现有方法的性能。