A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with function approximation and normal-form games with bandit feedback. We focus on equilibrium computation, in which a centralized learning algorithm aims to compute an equilibrium by controlling multiple agents that interact with an unknown environment. Our main contributions are: - We provide upper and lower bounds on the optimal sample complexity for multi-agent decision making based on a multi-agent generalization of the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. (2021) in the single-agent counterpart to our setting. Compared to the best results for the single-agent setting, our bounds have additional gaps. We show that no "reasonable" complexity measure can close these gaps, highlighting a striking separation between single and multiple agents. - We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making, but with hidden (unobserved) rewards, a framework that subsumes variants of the partial monitoring problem. As a consequence, we characterize the statistical complexity for hidden-reward interactive decision making to the best extent possible. Building on this development, we provide several new structural results, including 1) conditions under which the statistical complexity of multi-agent decision making can be reduced to that of single-agent, and 2) conditions under which the so-called curse of multiple agents can be avoided.
翻译:在多智能体强化学习(MARL)理论中的一个核心问题是理解哪些结构条件和算法原则可以导致样本效率的学习保证,以及在从少量智能体到多个智能体的过程中,这些考虑会发生什么变化。我们研究了一个与多个智能体互动制定决策的一般框架,该框架包括具有函数逼近的马尔可夫博弈和带有赌博反馈的正式博弈。我们专注于均衡计算,其中一个集中的学习算法旨在通过控制多个与未知环境交互的智能体计算出一个均衡。我们的主要贡献是:- 我们提供基于决策-估计系数的多个智能体的一般化的最优样本复杂度的上下限,这是一个复杂度度量,由Foster等人(2021年)在单个智能体 对应于我们的设置。与单个智能体设置的最佳结果相比,我们的边界有额外的差距。我们表明,没有“合理”的复杂度度量可以弥补这些差距,突显单个和多个智能体之间的显着分离。- 我们表明,对多智能体决策制定的统计复杂性进行描述相当于对单智能体决策制定的统计复杂性进行描述,但具有隐藏(未观察到的)奖励,这个框架涵盖了部分监测问题的变体。因此,我们尽可能地描述了隐藏奖励互动决策制定的统计复杂性。在此基础上,我们提供了几个新的结构性结果,包括1)在哪些情况下多智能体决策制定的统计复杂性可以减少到单一智能体决策制定的统计复杂性,并且2)在哪些情况下所谓的多智能体的诅咒可以被避免。