Trust region methods are widely applied in single-agent reinforcement learning problems due to their monotonic performance-improvement guarantee at every iteration. Nonetheless, when applied in multi-agent settings, the guarantee of trust region methods no longer holds because an agent's payoff is also affected by other agents' adaptive behaviors. To tackle this problem, we conduct a game-theoretical analysis in the policy space, and propose a multi-agent trust region learning method (MATRL), which enables trust region optimization for multi-agent learning. Specifically, MATRL finds a stable improvement direction that is guided by the solution concept of Nash equilibrium at the meta-game level. We derive the monotonic improvement guarantee in multi-agent settings and empirically show the local convergence of MATRL to stable fixed points in the two-player rotational differential game. To test our method, we evaluate MATRL in both discrete and continuous multiplayer general-sum games including checker and switch grid worlds, multi-agent MuJoCo, and Atari games. Results suggest that MATRL significantly outperforms strong multi-agent reinforcement learning baselines.
翻译:由于单一试剂强化学习的单一性能改进保证,信任区域方法被广泛应用于单一试剂强化学习问题。然而,在多试剂环境下应用时,信任区域方法的保障不再有效,因为代理人的回报也受到其他代理人的适应行为的影响。为了解决这一问题,我们在政策空间中进行游戏理论分析,并提议一种多试剂信任区域学习方法(MATRL),使信任区域能够优化多试剂学习。具体地说,MATRL找到了一个稳定的改进方向,该方向以元游戏一级纳什平衡的解决方案概念为指导。我们在多试剂环境中获得单一性改进保证,并从经验上表明MATRL与两玩轮转游戏中固定点的本地趋同。为了测试我们的方法,我们在离散和连续的多玩者普通游戏中评价MATRL,包括核对器和交换网格世界、多试剂 MuJoco和Atari游戏。结果显示,MATRL大大超出强化多试剂学习基线。