This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.
翻译:本文用同步动作来考虑双玩家零和限制和限制- horizon Markov 游戏。 研究的重点是以普通功能等级参数参数参数来参数化价值函数或模型的具有挑战性的设置。 开发了分解和 {协调} 设置的极高效算法。 在代理控制单个玩家和玩耍对抗任意对手的 {dcoupled} 设置的 代理控制单一玩家和玩玩游戏的 新的无型算法。 我们提出一个新的无型算法。 样本复杂性受Minimax Eluder 维度 -- -- Markov 游戏功能类的一个新维度 -- -- 作为特例, 这个方法用一个 $\ sqrt{ d} 来改善最先进的算法。 当奖励功能和过渡内核的参数用美元- 线性特性来参数化时, 可能是有的。 { dord} 在基于模式的算法中, 我们建议一种基于模型的算法和无型算法。 在基于模型的算法中, 我们证明样复杂程度可以被证人级的普通化为Markov $- krock 。 。 。 在模型中, 美元中, 最高算法中, 。