Multi-agent reinforcement learning (MARL) has attracted much research attention recently. However, unlike its single-agent counterpart, many theoretical and algorithmic aspects of MARL have not been well-understood. In this paper, we study the emergence of coordinated behavior by autonomous agents using an actor-critic (AC) algorithm. Specifically, we propose and analyze a class of coordinated actor-critic algorithms (CAC) in which individually parametrized policies have a {\it shared} part (which is jointly optimized among all agents) and a {\it personalized} part (which is only locally optimized). Such kind of {\it partially personalized} policy allows agents to learn to coordinate by leveraging peers' past experience and adapt to individual tasks. The flexibility in our design allows the proposed MARL-CAC algorithm to be used in a {\it fully decentralized} setting, where the agents can only communicate with their neighbors, as well as a {\it federated} setting, where the agents occasionally communicate with a server while optimizing their (partially personalized) local models. Theoretically, we show that under some standard regularity assumptions, the proposed MARL-CAC algorithm requires $\mathcal{O}(\epsilon^{-\frac{5}{2}})$ samples to achieve an $\epsilon$-stationary solution (defined as the solution whose squared norm of the gradient of the objective function is less than $\epsilon$). To the best of our knowledge, this work provides the first finite-sample guarantee for decentralized AC algorithm with partially personalized policies.
翻译:多试剂强化学习(MARL)最近引起了许多研究关注。然而,与它的单一试剂对应方不同的是,MARL的许多理论和算法方面没有很好地理解。在本文中,我们研究了自主代理商使用演员-批评(AC)算法的协调行为。具体地说,我们建议和分析一组协调的演员-批评算法(CAC),其中,个人配制政策有一个部分(在所有代理商之间共同优化),一个部分(只是局部优化的)个人化 。这种 这样的“部分个人化”政策使代理商能够通过利用同行过去的经验来学习协调,并适应个人的任务。我们设计的灵活性使得拟议的MARL-C算法能够在一个完全分散的环境下使用,在此范围内,代理商只能与其邻居沟通,以及一个“Fit federed”制环境,其中代理商有时与服务器沟通,同时优化其(部分个人化的)当地模式。理论上,我们显示,在某种标准的常规性假设下, 需要某种美元的“美元”的“MAL” 标准的正常” 样本。