We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from the collected data. For both approaches, the algorithm achieves a sample complexity of poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.
翻译:我们用非线性函数近似法研究多试剂通用和马尔科夫游戏。 我们侧重于低级马科夫游戏,其过渡矩阵在未知的非线性代表制之上承认隐藏的低级别结构。 目标是设计一种算法:(1) 在没有事先对环境或代表制的了解的情况下,找到一个美元-平衡的美元政策样本,这种算法是有效的,没有事先对环境或代表制的了解,而(2) 允许一个深层次的友好执行。 我们利用代表性学习,提出一种基于模型和无模式的方法,以从所收集的数据中建立有效的代表制。 对于这两种方法,算法都具有多边(H,d,A,1/ varepsilon)的样本复杂性。 对于这两种方法, 算法都具有多边(H,d,A,1/\ varepslon) 的样本复杂性。 美元是功能的维度矢值, $A是联合行动空间的大小, 美元是最佳性执行。 当玩家人数众多时, 以上基于以最差的游戏复杂性可以与最差的玩家数量成指数。 为了应对这一挑战, 我们认为Markov游戏, 与第一个因素的过渡结构游戏, 我们最接近性的游戏的游戏, 并呈现一个比重的游戏的游戏的游戏, 我们的游戏的游戏的基底基数级的算法。