We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players' actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by minimizing the duality gap. We propose an algorithm Nash-UCRL based on the principle "Optimism-in-Face-of-Uncertainty". Our algorithm only needs to find a Coarse Correlated Equilibrium (CCE), which is computationally efficient. Specifically, we show that Nash-UCRL can provably achieve an $\tilde{O}(dH\sqrt{T})$ regret, where $d$ is the linear function dimension, $H$ is the length of the game and $T$ is the total number of steps in the game. To assess the optimality of our algorithm, we also prove an $\tilde{\Omega}( dH\sqrt{T})$ lower bound on the regret. Our upper bound matches the lower bound up to logarithmic factors, which suggests the optimality of our algorithm.
翻译:我们研究双玩者零和马可夫游戏的强化学习,同时在有限偏差设置中进行动作。 基底马可夫游戏的过渡内核可以通过线性功能对当前状态、 玩家的动作和下一个状态进行参数化。 特别是, 我们假设我们可以控制两个玩家, 并且通过将双元差最小化来寻找纳什平衡。 我们提议基于“ 极差法- 法- 不确定性” 原则的纳什- 乌克拉算法。 我们的算法只需要找到一个具有计算效率的 Coarse Cor 相关平衡( CCCE) 。 具体地说, 我们证明纳什- UCREL 可以实现$\ tdelde{ O} (dH\ sqrt{T} ) $( dH\ sqrt{T} ) 令人遗憾, 以美元为线性功能的维度, $H是游戏的长度, $T是游戏的总共步骤。 为了评估我们的算法的最佳性, 我们还证明一个 $\\\ Omqrequal destrate latiquest lax 。