This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
翻译:本文探讨了在部分可观察性下多点强化学习(MARL)的艰巨任务,在这种任务中,每个代理商只看到自己的个别观察和行动,这些观察和行动揭示了系统基本状态的不完整信息。本文在多球员通用和部分可观察的Markov运动(POMGs)的一般模式下研究这些任务,该模式大大大于不合格信息大形式运动(IIEFGs)的标准模式。我们找出了一个富饶的POMGs子类 -- -- 暴露不力的POMGs -- -- 其样本效率高的学习是可移植的。在自演环境里,我们证明将乐观和最大相似性模拟模拟(MLE)相结合的简单算法足以在与最佳模样政策相比时找到大约的Nash equiblimaria、 相关 equiquilibraria 以及 微暴露POMGs 的相近似对应的样本数量。我们发现,在对付敌对对手时,我们乐观的MLE 算法的变式是能够第一次取得子级的遗憾。