This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes. Our algorithms are based on new techniques of alternate optimism.
翻译:本文用同步动作来考虑两个玩家的零和限值- orizon Markov 游戏。 研究的重点是以普通功能等级参数参数来参数化价值函数或模型的具有挑战性的设置。 开发了分解和 {协调} 设置的极高效算法。 在代理控制单个玩家和玩耍对抗任意对手的 {dcoupled} 设置的 代理控制单一玩家和玩耍的无型算法中, 我们提议一个新的无型算法。 样本复杂性受Minimax Eluder 维度 -- -- Markov 游戏功能类的新维度 -- -- 作为特例, 这个方法用一个 $\ sqrt{ d} 来改进最先进的算法。 当奖励功能和过渡内核的参数以美元- 维度线性特性来进行参数化时, 令人后悔。 在 { { dcoord} 设置两个玩家都受代理人控制的 以模型为基础的算法和无型算法。 在基于模型的算法中, 我们证明, 样本复杂性可以被一般证人级别到Markov $ 的替代游戏。 美元 的高级算法以美元为基础。