Two player zero sum simultaneous action games are common in video games, financial markets, war, business competition, and many other settings. We first introduce the fundamental concepts of reinforcement learning in two player zero sum simultaneous action games and discuss the unique challenges this type of game poses. Then we introduce two novel agents that attempt to handle these challenges by using joint action Deep Q-Networks (DQN). The first agent, called the Best Response AgenT (BRAT), builds an explicit model of its opponent's policy using imitation learning, and then uses this model to find the best response to exploit the opponent's strategy. The second agent, Meta-Nash DQN, builds an implicit model of its opponent's policy in order to produce a context variable that is used as part of the Q-value calculation. An explicit minimax over Q-values is used to find actions close to Nash equilibrium. We find empirically that both agents converge to Nash equilibrium in a self-play setting for simple matrix games, while also performing well in games with larger state and action spaces. These novel algorithms are evaluated against vanilla RL algorithms as well as recent state of the art multi-agent and two agent algorithms. This work combines ideas from traditional reinforcement learning, game theory, and meta learning.
翻译:游戏游戏、 金融市场、 战争、 商业竞争以及许多其他设置中常见于两个玩家零和同时行动游戏。 我们首先在两个玩家零和同时行动游戏中引入强化学习的基本概念, 并讨论这种游戏带来的独特的挑战。 然后我们引入两个新的代理机构, 试图通过共同行动来应对这些挑战 深QNetwork (DQN) 。 第一个代理机构叫做最佳反应 AgenT (BRAT), 利用模仿学习来建立其对手政策的清晰模式, 然后使用这个模型来寻找利用对手战略的最佳反应 。 第二个代理机构Meta- Nash DQN 建立了其对手政策的隐含模型, 以产生一个用于Q值计算的一部分的上下文变量。 一个明显的微缩缩缩 用于寻找接近纳什平衡的行动 。 我们从经验中发现, 两个代理机构在简单矩阵游戏的自玩环境里都与纳什平衡一致, 同时在更大的州和行动空间里进行良好的游戏中运行。 这些新算法将 Vanilla RL 和两个强化模型作为最新艺术模型的合并。