We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work. In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others. There is a communication graph, among which the agents can exchange information with their neighbors. The agents make individual decisions and cooperate to reach a higher accumulated reward. Towards this end, we first propose a decentralized actor-critic (AC) setting. Then, the policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively. Furthermore, convergence analysis is given under the discrete-space case, which guarantees that the policy will be reinforced by alternating between the processes of policy evaluation and policy improvement. In order to validate the effectiveness of algorithms, we design experiments and compare them with previous algorithms, e.g., Q-learning \cite{watkins1992q} and MADDPG \cite{lowe2017multi}. The results show that our algorithms perform better from the aspects of both learning speed and final performance. Moreover, the algorithms can be executed in an off-policy manner, which greatly improves the data efficiency compared with on-policy algorithms.
翻译:我们在这项工作中讨论了分散多试剂强化学习(MARL)的问题。 在我们的环境下,全球状态、行动和奖赏被认为是完全可见的,而地方政策则被每个代理商作为隐私保护,因此不能与他人分享。有一个通信图,其中代理商可以与其邻居交换信息。代理商可以作出个别决定并进行合作以达到更高的累积奖赏。为此,我们首先建议一个分散的行为者――捷克(AC)设置。然后,政策评估和政策改进算法分别针对离散和连续的州-行动-空间Markov决策过程(MDP)设计。此外,在离散空间案例下,对趋同分析保证政策通过政策评估与政策改进过程交替而得到加强。为了验证算法的有效性,我们设计了实验,并将它们与以前的算法(例如Q-learning {watkinkins1992q} 和MADDPG {cite{lowe2017multy}进行比较。结果显示,我们的算法方法在从数据学习速度和最终政策两方面都能够更好地执行,从数据分析中可以改进数据效率。