In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents to make decisions in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective through collectives. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize the local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenarios across scales.
翻译:在本文中,我们研究了网络化多试剂强化学习(MARL)的问题,在这种学习中,一些代理机构被作为部分连接的网络部署,而且每个代理机构只与附近的代理机构互动。网络化的MARL要求所有代理机构以分散的方式作出决定,以优化全球目标,而网络上的邻居之间有限制的交流。受共享在人类学习合作方面发挥关键作用这一事实的启发,我们建议LToS,一个分等级分级的MARL框架,使代理机构能够学会与邻居积极分享奖励,从而鼓励代理机构通过集体方式就全球目标进行合作。对于每一个代理机构来说,高级别政策学会如何与邻居分享奖励,以便消除全球目标,而低级政策则学会优化由邻居高层政策引发的当地目标。两种政策形成双级优化和交替学习。我们从经验上证明LToS超越了社会困境和网络化MARL各种情景中的现有方法。