Markov games model interactions among multiple players in a stochastic, dynamic environment. Each player in a Markov game maximizes its expected total discounted reward, which depends upon the policies of the other players. We formulate a class of Markov games, termed affine Markov games, where an affine reward function couples the players' actions. We introduce a novel solution concept, the soft-Bellman equilibrium, where each player is boundedly rational and chooses a soft-Bellman policy rather than a purely rational policy as in the well-known Nash equilibrium concept. We provide conditions for the existence and uniqueness of the soft-Bellman equilibrium and propose a nonlinear least squares algorithm to compute such an equilibrium in the forward problem. We then solve the inverse game problem of inferring the players' reward parameters from observed state-action trajectories via a projected gradient algorithm. Experiments in a predator-prey OpenAI Gym environment show that the reward parameters inferred by the proposed algorithm outperform those inferred by a baseline algorithm: they reduce the Kullback-Leibler divergence between the equilibrium policies and observed policies by at least two orders of magnitude.
翻译:马尔科夫博弈模型描述了多个玩家在随机动态环境中的相互作用。马尔科夫博弈中的每个玩家最大化其预期总折扣回报,该回报取决于其他玩家的策略。我们构建了一类马尔科夫博弈,称为仿射马尔科夫博弈,在其中,仿射回报函数耦合了玩家的行动。我们引入了一种新的解决方案,即软Bellman等衡,其中每个玩家均为有限理性,并选择软Bellman策略,而不是像纳什平衡概念中那样纯粹的理性策略。我们提供了软Bellman等衡的存在和唯一性条件,并提出了一种非线性最小二乘算法,在前向问题中计算此等衡。然后,我们通过投影梯度算法解决了反向博弈问题,即通过观察到的状态-行动轨迹推断出玩家的回报参数。在 predator-prey 区域OpenAI Gym环境中的实验表明,该算法推断出的回报参数优于基线算法:通过至少两个数量级减少了等衡策略和观察到的策略之间的Kullback-Leibler散度。