Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially due to the new challenges arising from its game-theoretical nature. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension -- a new complexity measure adapted from its single-agent version (Jin et al., 2021). A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.
翻译:现代强化学习(RL)通常涉及与大型国家空间相关的实际问题,在这种空间中,必须部署功能近似值来接近价值函数或政策。虽然最近RL理论的进展涉及大量RL问题,具有一般功能近似值,但这类成功大多局限于单一试剂环境,仍然难以将这些结果扩大到多试剂RL,特别是由于游戏理论性质引起的新挑战。本文审议了双玩者马科夫运动会(MGs)的零和双曲。我们提出了一种新的算法,这种算法可以使用多数值的样本找到纳什平衡政策,对任何多试剂Bellman-Eluder具有低比重的MG(Bellman-Eluder),这是一种从单一试剂版本(Jin等人,2021年)改编成的新的复杂度措施。我们新算法的一个关键组成部分是开发者,它通过故意利用主要玩家的弱点来帮助学习。我们的理论框架是通用的,它适用于范围广泛的模型,包括但不限于表型的MG、具有线性或内核功能近近的MGs。