This paper proposes new, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (a) Nash-DQN algorithm, which integrates the deep learning techniques from single DQN into the classic Nash Q-learning algorithm for solving tabular Markov games; (b) Nash-DQN-Exploiter algorithm, which additionally adopts an exploiter to guide the exploration of the main agent. We conduct experimental evaluation on tabular examples as well as various two-player Atari games. Our empirical results demonstrate that (i) the policies found by many existing methods including Neural Fictitious Self Play and Policy Space Response Oracle can be prone to exploitation by adversarial opponents; (ii) the output policies of our algorithms are robust to exploitation, and thus outperform existing methods.
翻译:本文为学习双玩者零和马尔科夫游戏提出了新的、端到端的强化深层学习算法。 与以前训练代理人击败一组固定对手的努力不同,我们的目标是找到甚至没有敌对对手利用的纳什平衡政策。 我们提议:(a) Nash-DQN 算法,将单一DQN的深层学习技巧纳入传统的纳什Q-学习算法,用于解决表格马科夫游戏;(b) Nash-DQN-Expliter 算法,该算法进一步采用一个剥削者来指导主要代理人的探索。 我们对表格范例以及各种双玩家阿塔里游戏进行实验性评价。我们的经验结果表明,(i) 许多现有方法(包括神经自律游戏和政策空间反应奥雷奇)发现的政策很容易被对抗对手利用;(ii) 我们的算法输出政策坚固可加以利用,从而超越现有方法。</s>