In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that \textit{sharing} plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenario.
翻译:在本文中,我们研究了网络化多试剂强化学习(MARL)的问题,在这种学习中,一些代理机构被作为部分连接的网络部署,而且每个代理机构只与附近的代理机构互动。网络化的MARL要求所有代理机构以分散的方式作出决定,以优化全球目标,而网络的邻居之间有限制的交流。由于Textit{共享}在人类学习合作方面发挥着关键作用,我们建议LToS,这是一个分级分散的MARL框架,使代理机构能够学会与邻居分享奖励,从而鼓励代理人在全球目标方面进行合作。对于每一个代理机构来说,高级别政策学会如何与邻居分享奖励,以消除全球目标,而低级政策则学会优化周边高层政策所引领的地方目标。两种政策形成了双级优化和交替学习。我们的经验证明,LToS超越了社会困境和网络化MARL情景中的现有方法。