Parameter sharing, where each agent independently learns a policy with fully shared parameters between all policies, is a popular baseline method for multi-agent deep reinforcement learning. Unfortunately, since all agents share the same policy network, they cannot learn different policies or tasks. This issue has been circumvented experimentally by adding an agent-specific indicator signal to observations, which we term "agent indication." Agent indication is limited, however, in that without modification it does not allow parameter sharing to be applied to environments where the action spaces and/or observation spaces are heterogeneous. This work formalizes the notion of agent indication and proves that it enables convergence to optimal policies for the first time. Next, we formally introduce methods to extend parameter sharing to learning in heterogeneous observation and action spaces, and prove that these methods allow for convergence to optimal policies. Finally, we experimentally confirm that the methods we introduce function empirically, and conduct a wide array of experiments studying the empirical efficacy of many different agent indication schemes for graphical observation spaces.
翻译:每个代理商独立地学习一项政策,并在所有政策之间完全共享参数的参数共享,这是多试剂深度强化学习的流行基线方法。 不幸的是,由于所有代理商都拥有相同的政策网络,因此他们无法学习不同的政策或任务。 这个问题被实验性地规避,在观测中添加一个特定代理商的指标信号,我们称之为“代理指示 ” 。 然而,代理指示是有限的,因为不作修改,它不允许将参数共享应用于行动空间和/或观测空间各不相同的环境。这项工作将代理商的标识概念正式化,并证明它第一次能够与最佳政策趋同。 其次,我们正式引入了将参数共享扩大到在不同观测和行动空间学习的方法,并证明这些方法允许与最佳政策趋同。 最后,我们实验性地确认,我们采用的方法是实验性地发挥作用,并进行一系列广泛的实验,研究许多不同代理商对图形观测空间的经验效率计划。