We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players' actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by minimizing the duality gap. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty". Our algorithm only needs to find a Coarse Correlated Equilibrium (CCE), which is computationally very efficient. Specifically, we show that Nash-UCRL-VTR can provably achieve an $\tilde{O}(dH\sqrt{T})$ regret, where $d$ is the linear function dimension, $H$ is the length of the game and $T$ is the total number of steps in the game. To access the optimality of our algorithm, we also prove an $\tilde{\Omega}( dH\sqrt{T})$ lower bound on the regret. Our upper bound matches the lower bound up to logarithmic factors, which suggests the optimality of our algorithm.
翻译:我们研究双玩者零和马可夫游戏的强化学习,同时在有限偏差设置中进行动作。 基底马可夫游戏的过渡内核可以通过线性功能对当前状态、 玩家的动作和下一个状态进行参数化。 特别是, 我们假设我们可以控制两个玩家, 并尽可能缩小双性差距, 以寻找纳什平衡为目的。 我们提议基于“ 游戏的长度” 原则的纳什- 乌克拉- VTR 算法。 我们的算法只需要找到一个在计算上效率很高的 Coarse Cor 相关平衡(CCCE) 。 具体地说, 我们显示, Nash- UCRL- VTR 可以实现$tilde{O} (dH\ sqrt{T} $ 。 遗憾, $ddddddddd$是线性功能的维度, $H$是游戏的长度, $T是游戏中的步骤总数。 为了获取我们最优化的算法, 我们还证明一个低的上限的逻辑。