Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning. In the single-agent setting, this challenge is addressed by introducing intrinsic rewards that motivate agents to explore unseen regions of their state spaces; however, applying these techniques naively to the multi-agent setting results in agents exploring independently, without any coordination among themselves. Exploration in cooperative multi-agent settings can be accelerated and improved if agents coordinate their exploration. In this paper we introduce a framework for designing intrinsic rewards which consider what other agents have explored such that the agents can coordinate. Then, we develop an approach for learning how to dynamically select between several exploration modalities to maximize extrinsic rewards. Concretely, we formulate the approach as a hierarchical policy where a high-level controller selects among sets of policies trained on diverse intrinsic rewards and the low-level controllers learn the action policies of all agents under these specific rewards. We demonstrate the effectiveness of the proposed approach in cooperative domains with sparse rewards where state-of-the-art methods fail and challenging multi-stage tasks that necessitate changing modes of coordination.
翻译:以微薄的回报解决问题是强化学习中最重要的挑战之一。 在单一试剂环境下,通过引入内在奖赏来应对这一挑战,激励代理商探索其国家空间的隐蔽区域;然而,将这些技术天真地应用到多试剂中独立探索的代理商的成果上,而它们之间则没有任何协调。如果代理商协调其探索,在合作性多试剂环境中的探索可以加速和改进。在本文件中,我们引入了设计内在奖赏的框架,考虑其他代理商可以协调的内在奖赏。然后,我们开发了一种方法,学习如何在几种探索方式之间动态地作出选择,以尽量扩大外部奖赏。具体地说,我们制定了一种等级政策,由高层监管员在经过各种内在奖赏培训的成套政策中作出选择,而低级监管员则在这些特定奖赏下学习所有代理商的行动政策。我们展示了拟议办法在合作领域的有效性,在新颖的奖励领域,即最先进的方法失败和具有挑战性的多阶段任务需要改变协调模式。