We consider a multi-agent reinforcement learning problem where each agent seeks to maximize a shared reward while interacting with other agents, and they may or may not be able to communicate. Typically the agents do not have access to other agent policies and thus each agent is situated in a non-stationary and partially-observable environment. In order to obtain multi-agents that act in a decentralized manner, we introduce a novel algorithm under the popular framework of centralized training, but decentralized execution. This training framework first obtains solutions to a multi-agent problem with a single centralized joint-space learner, which is then used to guide imitation learning for independent decentralized multi-agents. This framework has the flexibility to use any reinforcement learning algorithm to obtain the expert as well as any imitation learning algorithm to obtain the decentralized agents. This is in contrast to other multi-agent learning algorithms that, for example, can require more specific structures. We present some theoretical bounds for our method, and we show that one can obtain decentralized solutions to a multi-agent problem through imitation learning.
翻译:我们认为一个多试剂强化学习问题,即每个代理商在与其他代理商互动时寻求最大限度分享奖励,而他们可能或可能无法沟通。一般而言,代理商无法获得其他代理政策,因此,每个代理商都处于非静止和部分可观测的环境中。为了获得以分散方式运作的多试剂,我们在集中化培训的大众框架内引入了一种新型算法,但执行权分散化。这个培训框架首先用一个单一的集中式联合空间学习者来找到解决多试剂问题的办法,然后用来指导独立分散式多试剂的仿照学习。这个框架具有灵活性,可以使用任何强化式学习算法获取专家以及任何仿照式学习算法获取分散式代理商。这与其他多试剂学习算法形成鲜明对比,例如,它需要更具体的结构。我们的方法有一些理论界限,并且我们表明,一个人可以通过模仿学习获得多试剂问题的分散式解决方案。