In single-agent Markov decision processes, an agent can optimize its policy based on the interaction with environment. In multi-player Markov games (MGs), however, the interaction is non-stationary due to the behaviors of other players, so the agent has no fixed optimization objective. In this paper, we treat the evolution of player policies as a dynamical process and propose a novel learning scheme for Nash equilibrium. The core is to evolve one's policy according to not just its current in-game performance, but an aggregation of its performance over history. We show that for a variety of MGs, players in our learning scheme will provably converge to a point that is an approximation to Nash equilibrium. Combined with neural networks, we develop the \emph{empirical policy optimization} algorithm, that is implemented in a reinforcement-learning framework and runs in a distributed way, with each player optimizing its policy based on own observations. We use two numerical examples to validate the convergence property on small-scale MGs with $n\ge 2$ players, and a pong example to show the potential of our algorithm on large games.
翻译:在单一试剂Markov 决策过程中, 代理商可以在与环境互动的基础上优化其政策。 但是, 在多玩家Markov 游戏( MGs) 中, 互动是非静止的, 因为其他玩家的行为, 所以代理商没有固定的优化目标 。 在本文中, 我们把玩家政策的演进看成是一个动态的过程, 并为纳什均衡提出一个新的学习计划。 核心是根据其当前的游戏性能来制定自己的政策, 并且将其业绩与历史相提并论 。 我们显示, 对于各种 MGs 来说, 我们学习计划中的玩家将会以近似于 Nash 平衡的方式聚集到一个点上。 我们与神经网络一起, 我们开发了 \ emph{ impirical 政策优化 算法, 在强化学习框架内实施, 并以分布方式运行, 由每个玩家根据自己的观察来优化其政策。 我们用两个数字示例来验证小型 MGs 与 $nge 2 玩家的趋同属性的趋同性, 并用一个 Pong 示例来显示大型游戏的算法的潜力 。